fiNERweb Collection A multilingual dataset for NER covering 91 langauges and 25 scripts • 3 items • Updated Dec 16, 2025 • 3
F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World Paper • 2603.19223 • Published 11 days ago • 30
Nemotron-Post-Training-v3 Collection Collection of datasets used in the post-training phase of Nemotron Nano and Super v3. • 28 items • Updated 6 days ago • 104
Nemotron-Cascade 2 Collection Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation • 4 items • Updated 6 days ago • 41
Omnilingual MT: Machine Translation for 1,600 Languages Paper • 2603.16309 • Published 14 days ago • 20
view article Article Efficient LLM Pretraining: Packed Sequences and Masked Attention Oct 7, 2024 • 69
Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA Paper • 2603.14782 • Published 15 days ago • 1
Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike Paper • 2603.15130 • Published 15 days ago • 1
view article Article Ulysses Sequence Parallelism: Training with Million-Token Contexts 22 days ago • 24
view article Article FlashHead: Accelerating Language Model Inference ~ *Efficient drop-in replacement for the classification head* 19 days ago • 2
Flash-KMeans: Fast and Memory-Efficient Exact K-Means Paper • 2603.09229 • Published 21 days ago • 81
Nemotron-Pre-Training-Datasets Collection Large scale pre-training datasets used in the Nemotron family of models. • 12 items • Updated 6 days ago • 133
Lost in Backpropagation: The LM Head is a Gradient Bottleneck Paper • 2603.10145 • Published 20 days ago • 11
NVIDIA Nemotron v3 Collection Open, Production-ready Enterprise Models • 15 items • Updated 6 days ago • 248
MixtureVitae study models and datasets Collection Collection of models and dataset related to MixtureVitae, open and fully reproducible pretraining dataset built from permissive sources • 16 items • Updated Feb 13 • 2
view article Article Scaling Pedagogical Pre-training: From Optimal Mixing to 10 Billion Tokens 25 days ago • 4