ModernBERT Dutch Base Wide
A ModernBERT model pretrained on Dutch mc4_nl_cleaned dataset. This model has 22 layers (like ModernBERT-base) but with wider hidden dimensions (1024 instead of 768), placing it between base and large in terms of parameters (230M).
Model Details
- Architecture: ModernBERT (Answer.AI/LightOn)
- Layers: 22
- Hidden size: 1024
- Attention heads: 16
- Intermediate size: 1536
- Vocab size: 32,128
- Parameters: 230M
- Tokenizer:
yhavinga/dutch-llama-tokenizer(SentencePiece, Dutch-optimized)
Training
- Dataset:
yhavinga/mc4_nl_cleaned(full config) - Steps: 2,000,000
- Batch size: 8 per device (multi-host TPU v4)
- Learning rate: 3e-5 with cosine decay to 1e-6
- Warmup steps: 20,000
- Weight decay: 0.01
- Sequence length: 1024
- Precision: bfloat16
Usage
from transformers import AutoTokenizer, ModernBertForMaskedLM
model = ModernBertForMaskedLM.from_pretrained("yhavinga/dmbert-dutchl-1024h-22l-2000000")
tokenizer = AutoTokenizer.from_pretrained("yhavinga/dmbert-dutchl-1024h-22l-2000000")
# Masked language modeling
inputs = tokenizer("Amsterdam is de<mask> van Nederland.", return_tensors="pt")
outputs = model(**inputs)
predictions = tokenizer.decode(outputs.logits[0, 4].topk(5).indices[0])
# Expected: "hoofdstad" (capital)
Model Architecture Differences
This model (1024h-22L-2) differs from the earlier 1024h-22L variant:
| Parameter | 1024h-22L | 1024h-22L-2 (this model) |
|---|---|---|
intermediate_size |
4096 | 1536 |
tokenizer |
jhu-clsp/mmBERT-small |
yhavinga/dutch-llama-tokenizer |
vocab_size |
256,000 | 32,128 |
The smaller intermediate MLP size and Dutch-specific tokenizer make this model more efficient while maintaining strong Dutch language understanding.
Citation
If you use this model, please cite:
@model{dmbert_dutchl_1024h,
title={Dutch ModernBERT 1024h-22L},
author={Yeb Havinga},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/yhavinga/dmbert-dutchl-1024h-22l-2000000}
}
- Downloads last month
- -