Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
Modern LLM RL is quietly off-policy far more than we admit. Adaptive Layerwise Perturbation (ALP) mitigates instability with one unified recipe: a single importance ratio where only the updated training policy is perturbed via learnable layerwise hidden-state perturbations, shrinking tail mismatch and smoothing sharp objectives.
Feb 12, 2026
Chenlu Ye*, Xuanchang Zhang*, Yifan Hao*, et al.