kshitijthakkar 's Collections

Large MoE Architecture Search (1B-2B)

Systematic search for 1B-2B MoE models. Best: bs=1, ctx=2048 achieves 0.32 loss. Top-8 routing beats top-2.