External
Feb 16, 2025
Hanning Zhang, et al.
Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead
Inspired by the success of Deepseek-R1-Zero and several replications of PPO training with rule-based reward which achieve superior performance on mathematical reasoning and the emergence of the “Aha moment” during the RL training, we are curious about alternative algorithms developed in the RLHF literature under this framework
Read on External Site