arxiv:2603.26164

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models

Published on Mar 27

· Submitted by

bohan zeng on Apr 3

Authors:

Abstract

DataFlex is a unified framework for dynamic data-centric training of large language models that supports sample selection, domain mixture adjustment, and sample reweighting while maintaining compatibility with standard training workflows and enabling efficient large-scale deployment.

AI-generated summary

Data-centric training has emerged as a promising direction for improving large language models (LLMs) by optimizing not only model parameters but also the selection, composition, and weighting of training data during optimization. However, existing approaches to data selection, data mixture optimization, and data reweighting are often developed in isolated codebases with inconsistent interfaces, hindering reproducibility, fair comparison, and practical integration. In this paper, we present DataFlex, a unified data-centric dynamic training framework built upon LLaMA-Factory. DataFlex supports three major paradigms of dynamic data optimization: sample selection, domain mixture adjustment, and sample reweighting, while remaining fully compatible with the original training workflow. It provides extensible trainer abstractions and modular components, enabling a drop-in replacement for standard LLM training, and unifies key model-dependent operations such as embedding extraction, inference, and gradient computation, with support for large-scale settings including DeepSpeed ZeRO-3. We conduct comprehensive experiments across multiple data-centric methods. Dynamic data selection consistently outperforms static full-data training on MMLU across both Mistral-7B and Llama-3.2-3B. For data mixture, DoReMi and ODM improve both MMLU accuracy and corpus-level perplexity over default proportions when pretraining Qwen2.5-1.5B on SlimPajama at 6B and 30B token scales. DataFlex also achieves consistent runtime improvements over original implementations. These results demonstrate that DataFlex provides an effective, efficient, and reproducible infrastructure for data-centric dynamic training of LLMs.

View arXiv page View PDF Project page GitHub 122 Add to collection

Community

zbhpku

Paper submitter about 6 hours ago

DataFlex is a data-centric training framework that enhances model performance by either selecting the most influential samples, optimizing their weights, or adjusting their mixing ratios.

MeiyiQiang

about 3 hours ago

DataFlex 是PKU DCAI实验室和LLaMA-Factory 团队联合开发的统一大模型数据中心化动态训练框架，一站式支持数据选择、数据混合、样本重加权三大核心能力，完美兼容原生训练流程，还支持 DeepSpeed ZeRO-3 大规模训练，能大幅提升实验可复现性与模型效果，不管是做研究还是实际开发，都很实用，欢迎一起交流～