ScaleML Lab @ UIUC

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

Modern LLM RL is quietly off-policy far more than we admit. Adaptive Layerwise Perturbation (ALP) mitigates instability with one unified recipe: a single importance ratio where only the updated training policy is perturbed via learnable layerwise hidden-state perturbations, shrinking tail mismatch and smoothing sharp objectives.

Feb 12, 2026

Chenlu Ye*, Xuanchang Zhang*, Yifan Hao*, et al.

Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead

Inspired by the success of Deepseek-R1-Zero and several replications of PPO training with rule-based reward which achieve superior performance on mathematical reasoning and the emergence of the “Aha moment” during the RL training, we are curious about alternative algorithms developed in the RLHF literature under this framework

Feb 16, 2025

Hanning Zhang, et al.