Boldt-DC-1B German IT 16K

A 1.25B-parameter German instruction-tuned Llama model with a 16,384-token context window, extended from the base model's native 2,048-token context via YaRN RoPE scaling (factor 8.0).

The model was produced in two stages on top of Boldt/Boldt-DC-1B:

  1. Continued pretraining (CPT 16K) on long German documents (RAG contexts) to adapt the model to the extended context window.
  2. Supervised fine-tuning (SFT 16K) on curated German instruction data.

Both stages used Unsloth + TRL SFTTrainer with a LoRA adapter (r=64, α=64) over q/k/v/o/gate/up/down_proj modules; the released checkpoint is the merged base + adapter in bfloat16.

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "mayflowergmbh/boldt-dc-1b-german-it-16k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)

prompt = (
    "<|system|>\nDu bist ein präziser deutschsprachiger Assistent.\n"
    "<|user|>\nFasse die wichtigsten Punkte der DSGVO in drei Sätzen zusammen.\n"
    "<|assistant|>\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=False))

The chat template uses four special tokens: <|system|>, <|user|>, <|assistant|>, <|end|>.

Architecture

Base model Boldt/Boldt-DC-1B (Llama, 1.25B parameters)
Hidden size / layers / heads 2,048 / 16 / 16
Intermediate size 8,192
Vocabulary 32,004 (4 added chat tokens)
Native context 2,048
Extended context 16,384 (YaRN, factor 8.0)
Released precision bfloat16, merged single shard
Tensor format safetensors

The RoPE configuration in config.json:

"rope_parameters": {
  "rope_type": "yarn",
  "factor": 8.0,
  "original_max_position_embeddings": 2048,
  "rope_theta": 10000.0
}

Training

CPT 16K SFT 16K
Steps 5,000 7,000
Effective batch size 1 × 32 (grad-accum) 1 × 32 (grad-accum)
max_seq_length 16,384 16,384
Packing enabled disabled
LR / schedule 7e-5, cosine, 3 % warmup 5e-5, cosine, 3 % warmup
LoRA r / α 64 / 64 64 / 64 (continued)
Target modules q/k/v/o + gate/up/down q/k/v/o + gate/up/down
Optimizer adamw_torch_fused adamw_torch_fused
Gradient checkpointing Unsloth Unsloth
Wall-clock (A6000, sm_8.6) ~17 h (resumed) ~32.8 h
Final eval loss 1.787 1.117
Final train loss 1.751 (last 40-step window) 1.235
Epochs (over training data) ~20.8 ~3.2

The CPT run experienced one external OOM-kill at step 4480/5000 (host swap pressure from unrelated services) and was resumed from checkpoint-4000 after pinning the trainer to a single A6000 with MALLOC_ARENA_MAX=2 and nice -n 19 ionice -c 3. The loss curve is continuous across the resume — eval loss went 1.7878 (step 4000) → 1.7873 (4250) → 1.7871 (4500) → 1.7787 (5000), monotonically decreasing.

Datasets

CPT and SFT data were assembled from publicly available German corpora. Each dataset retains its original license; users of this model must review the upstream licenses before commercial use.

Dataset Stage Rows ingested Purpose
DiscoResearch/germanrag CPT + SFT 3,362 + 3,362 Long German RAG contexts
CausalLM/GPT-4-Self-Instruct-German SFT 10,006 German self-instruct
seedboxai/multitask_german_examples_32k SFT 50,000 German multitask instructions
maxidl/Capybara-de SFT 15,991 German conversational instructions
AgentWaller/german-oasst1-qa-format SFT 9,843 OASST-style German QA
dennlinger/klexikon SFT 2,346 German simplification/summarization
flozi00/german-function-calling SFT 1,327 German function-calling

After deduplication, packing, and length filtering, the processed splits used in training were:

  • CPT: 65,168 train / 3,501 validation rows
  • SFT: 70,029 train / 3,671 validation rows
  • Needle eval: 512 synthetic German long-context retrieval examples

Sources that failed to load and were excluded: GEM/mlsum (deprecated dataset scripts in datasets library) and PleIAs/German-PD (intentionally disabled in this run).

Evaluation

Long-context needle-in-haystack (German)

5 runs at 3 needle depths × 3 context lengths each. The model must recover a 6-digit "geheime Prüfnummer" injected into a long German document.

Context length Attempts Successes Success rate
4,096 5 5 100 %
8,192 5 5 100 %
16,384 5 5 100 %

This is the gate that validates the YaRN extension actually works in practice; the model retrieves the needle reliably at the full 16K context.

Loss curves

SFT eval loss (every 250 steps)
 step 250  1.277
 step 500  1.256
 step 1000 1.247
 step 2500 1.195
 step 4000 1.128
 step 5000 1.120
 step 6000 1.118
 step 7000 1.117

Smooth monotonic decrease; the last five evaluations are within ±0.001 of each other — the model is fully converged on the SFT distribution.

Intended use and limitations

Intended use. German-language assistant tasks: summarization, question answering, instruction following, simple function-calling templates, and long-document Q&A up to 16K tokens.

Capability ceiling. This is a 1.25 B parameter model. Free-form generation can drift off-topic, hallucinate facts, or produce generic prose, especially on open-ended prompts. The strong long-context needle results indicate the model can retrieve information injected into a long German document, not that it can reason over arbitrarily long inputs.

Languages. Trained primarily for German. The tokenizer is shared with the Llama base, so English is supported as a side effect but quality is not the project's target.

Safety. No RLHF or DPO alignment was applied. Outputs are not guaranteed to be safe, polite, or non-toxic. Apply your own safety layer before deploying.

Licensing. This card is released under Apache 2.0, but the model weights inherit constraints from the base model (Boldt/Boldt-DC-1B) and from each contributing dataset's license. Review those before commercial deployment.

How it was trained

The full pipeline (system check → install → smoke test → CPT 16K → SFT 16K → long-context eval → merge → final report) is reproducible from a Makefile. Key hardware: a single NVIDIA RTX A6000 (sm_8.6, 48 GB), CUDA 12.8 wheels, Unsloth 2026.5.2, Transformers 5.5.0, TRL 0.24.0, PEFT 0.19.1, PyTorch 2.10.0+cu128.

Acknowledgments

  • Base model: Boldt/Boldt-DC-1B.
  • Training stack: Unsloth, TRL, PEFT.
  • Dataset authors listed above — thank you for publishing high-quality German data.
Downloads last month
12
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mayflowergmbh/boldt-dc-1b-german-it-16k

Adapter
(1)
this model

Datasets used to train mayflowergmbh/boldt-dc-1b-german-it-16k