Hejian Sang's picture

Hejian Sang

pb09204048

·

AI & ML interests

None yet

Recent Activity

upvoted a paper 15 days ago

On-Policy Self-Distillation for Reasoning Compression

submitted a paper 15 days ago

On-Policy Self-Distillation for Reasoning Compression

authored a paper 18 days ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

View all activity

Organizations

upvoted a paper 15 days ago

On-Policy Self-Distillation for Reasoning Compression

Paper • 2603.05433 • Published 16 days ago • 6

submitted a paper to Daily Papers 15 days ago

On-Policy Self-Distillation for Reasoning Compression

Paper • 2603.05433 • Published 16 days ago • 6

authored a paper 18 days ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Paper • 2602.21420 • Published 24 days ago • 6

upvoted a paper 21 days ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Paper • 2602.21420 • Published 24 days ago • 6

submitted a paper to Daily Papers 22 days ago

Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning

Paper • 2602.21420 • Published 24 days ago • 6

commented a paper 26 days ago

Reinforcement Learning via Self-Distillation

Paper • 2601.20802 • Published Jan 28 • 42 •

commented a paper about 1 month ago

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Paper • 2601.18734 • Published Jan 26 • 3 •

commentedon Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective about 2 months ago

Hi @sseymens
Thank you for your comments.
I can help to reply your question about MOE on policy part.

Yeah, forcing old_log_prob = log_prob.detach() does not solve the on policy issue since the prob is using current policy but sampling distribution can be different due to expert selection.
When we explored the agentic issues for gpt-oss training, we did not root the cause at the beginning. One hypothesis is due to inference-training inconsistency. After we apply the importance sampling, it does not help. So we test if forcing old_log_prob = log_prob.detach() will alleviate the issue if this is the root cause. This is just for hypothesis testing.
When we explored the agentic issues for gpt-oss training, verl has not supported expert router replay yet. So we cannot test this idea. https://arxiv.org/pdf/2510.11370v1. Now we tested the relay. But this is not the root cause too. The root cause is attention sink.

published an article about 2 months ago

Article

Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

Jan 27

•

67

authored a paper 2 months ago

Debunk the Myth of SFT Generalization

Paper • 2510.00237 • Published Sep 30, 2025 • 2

upvoted 2 papers 6 months ago

Debunk the Myth of SFT Generalization

Paper • 2510.00237 • Published Sep 30, 2025 • 2

Planner-R1: Reward Shaping Enables Efficient Agentic RL with Smaller LLMs

Paper • 2509.25779 • Published Sep 30, 2025 • 19

liked 4 datasets 6 months ago

anisha2102/RaR-Science-20k-o3-mini

Viewer • Updated Oct 5, 2025 • 22.9k • 19 • 4

anisha2102/RaR-Medicine-20k-o3-mini

Viewer • Updated Oct 5, 2025 • 22.4k • 48 • 5

LLM360/guru-RL-92k

Viewer • Updated Aug 20, 2025 • 91.9k • 1.53k • 45

a-m-team/AM-Thinking-v1-Distilled

Preview • Updated Jun 12, 2025 • 943 • 58

liked a model 10 months ago

Wan-AI/Wan2.1-FLF2V-14B-720P

Updated Apr 17, 2025 • 1.44k • 228

upvoted a collection over 1 year ago

Tulu 3 Datasets

All datasets released with Tulu 3 -- state of the art open post-training recipes. • 32 items • Updated 19 days ago • 96

liked 2 datasets about 2 years ago

google-research-datasets/mbpp

Viewer • Updated Jan 4, 2024 • 1.4k • 288k • 222

openai/openai_humaneval

Viewer • Updated Jan 4, 2024 • 164 • 229k • 373