BTW, I think this should also work, but it contradict to the validation num_process == dp_replicate_size * dp_shard_size * tp_size * cp_size * sp_size. From my understanding, sharding the model with dp_shard_size should never be affected by the non_data_parallel_size. Thanks!
Ruiyuan Gao
AI & ML interests
Recent Activity
Organizations
Thanks for the reply. I am still confused about the calculation.
From my understanding now, non-data parallels usually expect to take the same batched data (or say, process the data from the same batch). I agreee with total_size = data_parallel_size * non_data_parallel_size. While since dp_shard_size effectively stands for the size of different batches of data in one iteration (as from here), then it should be data_parallel_size, so what does dp_replicate_size stand for?
Take an example. My usecase is 256 ranks with fsdp2 and cp. I want to have model sharded on each 8 ranks (with 32 replicas), with cp size = 4. From the docs, I should set dp_shard_size=8, cp_size=4, and maybe dp_replicate_size=8? I believe by setting cp=4, I should have 256/4=64 distinct batches in each iteration. It seems to be correct if we using data_parallel_size = dp_shard_size * dp_replicate_size for calculation, but I cannot get the meaning from dp_replicate_size=8. It is neither the model replicate size (which is 32), nor the replication size of each batch (which is 4).
May I ask whether the current implementation in accelerate correct? I think it should be world_size = dp_replicate_size * dp_shard_size, since dp and other parallel are on different dimensions. Thanks!