ScaleML Lab
PeopleResearchBlogJoin

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

Modern LLM RL is quietly off-policy far more than we admit. Adaptive Layerwise Perturbation (ALP) mitigates instability with one unified recipe: a single importance ratio where only the updated training policy is perturbed via learnable layerwise hidden-state perturbations, shrinking tail mismatch and smoothing sharp objectives.

Feb 12, 2026
Chenlu Ye*, Xuanchang Zhang*, Yifan Hao*, et al.

Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead

Inspired by the success of Deepseek-R1-Zero and several replications of PPO training with rule-based reward which achieve superior performance on mathematical reasoning and the emergence of the “Aha moment” during the RL training, we are curious about alternative algorithms developed in the RLHF literature under this framework

Feb 16, 2025
Hanning Zhang, et al.
ScaleML Lab

Quick Links

  • People
  • Research

Contact

  • ScaleML Lab @ UIUC
  • Computer Science Department

© 2026 ScaleML Lab @ UIUC. All rights reserved.