On-Policy Self-Distillation for Reasoning Compression
Paper • 2603.05433 • Published • 6
Hi @sseymens
Thank you for your comments.
I can help to reply your question about MOE on policy part.
old_log_prob = log_prob.detach() does not solve the on policy issue since the prob is using current policy but sampling distribution can be different due to expert selection.old_log_prob = log_prob.detach() will alleviate the issue if this is the root cause. This is just for hypothesis testing.