VoladorLuYu 's Collections Super Alignment
updated
Trusted Source Alignment in Large Language Models
Paper
• 2311.06697
• Published • 12
Diffusion Model Alignment Using Direct Preference Optimization
Paper
• 2311.12908
• Published • 49
SuperHF: Supervised Iterative Learning from Human Feedback
Paper
• 2310.16763
• Published • 1
Enhancing Diffusion Models with Text-Encoder Reinforcement Learning
Paper
• 2311.15657
• Published • 2
Using Human Feedback to Fine-tune Diffusion Models without Any Reward
Model
Paper
• 2311.13231
• Published • 28
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
Paper
• 2310.03739
• Published • 22
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI
Feedback
Paper
• 2309.00267
• Published • 53
Aligning Language Models with Offline Reinforcement Learning from Human
Feedback
Paper
• 2308.12050
• Published • 1
Q-Transformer: Scalable Offline Reinforcement Learning via
Autoregressive Q-Functions
Paper
• 2309.10150
• Published • 26
Secrets of RLHF in Large Language Models Part I: PPO
Paper
• 2307.04964
• Published • 30
Efficient RLHF: Reducing the Memory Usage of PPO
Paper
• 2309.00754
• Published • 16
Aligning Large Multimodal Models with Factually Augmented RLHF
Paper
• 2309.14525
• Published • 32
Nash Learning from Human Feedback
Paper
• 2312.00886
• Published • 18
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from
Fine-grained Correctional Human Feedback
Paper
• 2312.00849
• Published • 12
Training Chain-of-Thought via Latent-Variable Inference
Paper
• 2312.02179
• Published • 9
Reinforcement Learning from Diffusion Feedback: Q* for Image Search
Paper
• 2311.15648
• Published
OneLLM: One Framework to Align All Modalities with Language
Paper
• 2312.03700
• Published • 24
Training a Helpful and Harmless Assistant with Reinforcement Learning
from Human Feedback
Paper
• 2204.05862
• Published • 3
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
• 2403.05135
• Published • 45
Direct Nash Optimization: Teaching Language Models to Self-Improve with
General Preferences
Paper
• 2404.03715
• Published • 62
Dataset Reset Policy Optimization for RLHF
Paper
• 2404.08495
• Published • 9
Learn Your Reference Model for Real Good Alignment
Paper
• 2404.09656
• Published • 90
RLHF Workflow: From Reward Modeling to Online RLHF
Paper
• 2405.07863
• Published • 71
Iterative Reasoning Preference Optimization
Paper
• 2404.19733
• Published • 49
Iterative Length-Regularized Direct Preference Optimization: A Case
Study on Improving 7B Language Models to GPT-4 Level
Paper
• 2406.11817
• Published • 13
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
• 2406.18629
• Published • 42
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human
Annotations
Paper
• 2312.08935
• Published • 4
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical
Reasoning
Paper
• 2407.00782
• Published • 24
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Paper
• 2410.18451
• Published • 20