Naman Vats's picture
Building on HF

Naman Vats PRO

namanvats

AI & ML interests

Make Open Source AI win

Recent Activity

reacted to anakin87's post with โค๏ธ about 2 hours ago
How LLM training with RL Environments works? It all starts with ๐—ฅ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐˜„๐—ถ๐˜๐—ต ๐—ฉ๐—ฒ๐—ฟ๐—ถ๐—ณ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฅ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ๐˜€ - question asked - model generates reasoning + answer - answer checked against ground truth - reward drives RL training In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s) Consider a more complex tic-tac-toe env โŒโญ• It adds: - dynamic game generation/handling - tunable opponent skill - multi-turn interactions (envs can also include tools) --- What happens at training? We use ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ ๐—ฅ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฃ๐—ผ๐—น๐—ถ๐—ฐ๐˜† ๐—ข๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป with a tic-tac-toe env No critic model needed, the group is the baseline Simpler than PPO 1๏ธโƒฃ Rollout generation: from the same board, model plays N games via sampling 2๏ธโƒฃ Each game scored with deterministic rewards (win, format, ...) 3๏ธโƒฃ Mean score computed across the group 4๏ธโƒฃ Each rollout's advantage = its score minus the group mean 5๏ธโƒฃ Model updated to favor trajectories above baseline ๐Ÿ” Repeat For a deep dive, check out ๐ŸŒฑ https://github.com/anakin87/llm-rl-environments-lil-course a free hands-on course on RL environments for LLMs
View all activity

Organizations

Blog-explorers's profile picture Sentient Foundation's profile picture None yet's profile picture