Boldt-DC-1B German IT 16K
A 1.25B-parameter German instruction-tuned Llama model with a 16,384-token context window, extended from the base model's native 2,048-token context via YaRN RoPE scaling (factor 8.0).
The model was produced in two stages on top of Boldt/Boldt-DC-1B:
- Continued pretraining (CPT 16K) on long German documents (RAG contexts) to adapt the model to the extended context window.
- Supervised fine-tuning (SFT 16K) on curated German instruction data.
Both stages used Unsloth + TRL SFTTrainer with a LoRA adapter (r=64, α=64) over q/k/v/o/gate/up/down_proj modules; the released checkpoint is the merged base + adapter in bfloat16.
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "mayflowergmbh/boldt-dc-1b-german-it-16k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, dtype=torch.bfloat16, device_map="auto", trust_remote_code=True
)
prompt = (
"<|system|>\nDu bist ein präziser deutschsprachiger Assistent.\n"
"<|user|>\nFasse die wichtigsten Punkte der DSGVO in drei Sätzen zusammen.\n"
"<|assistant|>\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=False))
The chat template uses four special tokens: <|system|>, <|user|>, <|assistant|>, <|end|>.
Architecture
| Base model | Boldt/Boldt-DC-1B (Llama, 1.25B parameters) |
| Hidden size / layers / heads | 2,048 / 16 / 16 |
| Intermediate size | 8,192 |
| Vocabulary | 32,004 (4 added chat tokens) |
| Native context | 2,048 |
| Extended context | 16,384 (YaRN, factor 8.0) |
| Released precision | bfloat16, merged single shard |
| Tensor format | safetensors |
The RoPE configuration in config.json:
"rope_parameters": {
"rope_type": "yarn",
"factor": 8.0,
"original_max_position_embeddings": 2048,
"rope_theta": 10000.0
}
Training
| CPT 16K | SFT 16K | |
|---|---|---|
| Steps | 5,000 | 7,000 |
| Effective batch size | 1 × 32 (grad-accum) | 1 × 32 (grad-accum) |
max_seq_length |
16,384 | 16,384 |
| Packing | enabled | disabled |
| LR / schedule | 7e-5, cosine, 3 % warmup | 5e-5, cosine, 3 % warmup |
| LoRA r / α | 64 / 64 | 64 / 64 (continued) |
| Target modules | q/k/v/o + gate/up/down | q/k/v/o + gate/up/down |
| Optimizer | adamw_torch_fused | adamw_torch_fused |
| Gradient checkpointing | Unsloth | Unsloth |
| Wall-clock (A6000, sm_8.6) | ~17 h (resumed) | ~32.8 h |
| Final eval loss | 1.787 | 1.117 |
| Final train loss | 1.751 (last 40-step window) | 1.235 |
| Epochs (over training data) | ~20.8 | ~3.2 |
The CPT run experienced one external OOM-kill at step 4480/5000 (host swap pressure from unrelated services) and was resumed from checkpoint-4000 after pinning the trainer to a single A6000 with MALLOC_ARENA_MAX=2 and nice -n 19 ionice -c 3. The loss curve is continuous across the resume — eval loss went 1.7878 (step 4000) → 1.7873 (4250) → 1.7871 (4500) → 1.7787 (5000), monotonically decreasing.
Datasets
CPT and SFT data were assembled from publicly available German corpora. Each dataset retains its original license; users of this model must review the upstream licenses before commercial use.
| Dataset | Stage | Rows ingested | Purpose |
|---|---|---|---|
DiscoResearch/germanrag |
CPT + SFT | 3,362 + 3,362 | Long German RAG contexts |
CausalLM/GPT-4-Self-Instruct-German |
SFT | 10,006 | German self-instruct |
seedboxai/multitask_german_examples_32k |
SFT | 50,000 | German multitask instructions |
maxidl/Capybara-de |
SFT | 15,991 | German conversational instructions |
AgentWaller/german-oasst1-qa-format |
SFT | 9,843 | OASST-style German QA |
dennlinger/klexikon |
SFT | 2,346 | German simplification/summarization |
flozi00/german-function-calling |
SFT | 1,327 | German function-calling |
After deduplication, packing, and length filtering, the processed splits used in training were:
- CPT: 65,168 train / 3,501 validation rows
- SFT: 70,029 train / 3,671 validation rows
- Needle eval: 512 synthetic German long-context retrieval examples
Sources that failed to load and were excluded: GEM/mlsum (deprecated dataset scripts in datasets library) and PleIAs/German-PD (intentionally disabled in this run).
Evaluation
Long-context needle-in-haystack (German)
5 runs at 3 needle depths × 3 context lengths each. The model must recover a 6-digit "geheime Prüfnummer" injected into a long German document.
| Context length | Attempts | Successes | Success rate |
|---|---|---|---|
| 4,096 | 5 | 5 | 100 % |
| 8,192 | 5 | 5 | 100 % |
| 16,384 | 5 | 5 | 100 % |
This is the gate that validates the YaRN extension actually works in practice; the model retrieves the needle reliably at the full 16K context.
Loss curves
SFT eval loss (every 250 steps)
step 250 1.277
step 500 1.256
step 1000 1.247
step 2500 1.195
step 4000 1.128
step 5000 1.120
step 6000 1.118
step 7000 1.117
Smooth monotonic decrease; the last five evaluations are within ±0.001 of each other — the model is fully converged on the SFT distribution.
Intended use and limitations
Intended use. German-language assistant tasks: summarization, question answering, instruction following, simple function-calling templates, and long-document Q&A up to 16K tokens.
Capability ceiling. This is a 1.25 B parameter model. Free-form generation can drift off-topic, hallucinate facts, or produce generic prose, especially on open-ended prompts. The strong long-context needle results indicate the model can retrieve information injected into a long German document, not that it can reason over arbitrarily long inputs.
Languages. Trained primarily for German. The tokenizer is shared with the Llama base, so English is supported as a side effect but quality is not the project's target.
Safety. No RLHF or DPO alignment was applied. Outputs are not guaranteed to be safe, polite, or non-toxic. Apply your own safety layer before deploying.
Licensing. This card is released under Apache 2.0, but the model weights inherit constraints from the base model (Boldt/Boldt-DC-1B) and from each contributing dataset's license. Review those before commercial deployment.
How it was trained
The full pipeline (system check → install → smoke test → CPT 16K → SFT 16K → long-context eval → merge → final report) is reproducible from a Makefile. Key hardware: a single NVIDIA RTX A6000 (sm_8.6, 48 GB), CUDA 12.8 wheels, Unsloth 2026.5.2, Transformers 5.5.0, TRL 0.24.0, PEFT 0.19.1, PyTorch 2.10.0+cu128.
Acknowledgments
- Base model: Boldt/Boldt-DC-1B.
- Training stack: Unsloth, TRL, PEFT.
- Dataset authors listed above — thank you for publishing high-quality German data.
- Downloads last month
- 12
Model tree for mayflowergmbh/boldt-dc-1b-german-it-16k
Base model
Boldt/Boldt-DC-1B