Megatron Swift dpo training on Qwen/Qwen3-Next-80B-A3B-Instruct always always return nan loss. Why?
#45
by
tangyiming
- opened
I train on 6*8 gpus. and below is my training script
megatron rlhf \
--rlhf_type dpo \
--model input_model/Qwen3-Next-80B-A3B-Instruct \
--dataset data/test_dpo.jsonl \
--save saves_local \
--load_from_cache_file false \
--model_type qwen3_next \
--template qwen3_nothinking \
--torch_dtype bfloat16 \
--load_safetensors true \
--save_safetensors true \
--check_model false \
--split_dataset_ratio 0 \
--train_type full \
--moe_permute_fusion true \
--tensor_model_parallel_size 2 \
--pipeline_model_parallel_size 3\
--expert_model_parallel_size 8 \
--expert_tensor_parallel_size 1 \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 1e-3 \
--micro_batch_size 1 \
--global_batch_size 48 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--max_epochs 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-6 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-6 \
--save_interval 100 \
--eval_interval 10 \
--log_interval 1 \
--packing false \
--max_length 16384 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--attention_backend flash \
--optimizer_cpu_offload true \
--use_precision_aware_optimizer true \
--optimizer_offload_fraction 1 \
--beta 0.1 \
--rpo_alpha 0.1 \
--loss_type sigmoid