Publications

Our research contributions to the scientific community

2025

Featured

Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

Jiarui Yao*, Yifan Hao*, Hanning Zhang, Hanze Dong, Wei Xiong, Nan Jiang, Tong Zhang

NeurIPS 2025

PDF Code

MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning

Jingyan Shen*, Jiarui Yao*, Rui Yang*, Yifan Sun, Feng Luo, Rui Pan, Tong Zhang, Han Zhao

Outstanding Paper Award EMNLP 2025

PDF

FANS: Formal Answer Selection for Natural Language Math Reasoning Using Lean4

Jiarui Yao, Ruida Wang, Tong Zhang

EMNLP 2025

PDF

Featured

A Minimalist Approach to LLM Reasoning: From Rejection Sampling to Reinforcement Learning

Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong et al.

arXiv preprint arXiv:2504.11343, 2025

PDF Code

Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models

Zihan Wang*, Rui Pan*, Jiarui Yao, Róbert Csordás, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu

arXiv preprint arXiv:2506.18945, 2025

PDF

ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

Hanyang Chen*, Mark Zhao*, Rui Yang*, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan et al.

arXiv preprint arXiv:2510.12693, 2025

PDF

GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving

Ruida Wang, Jiarui Yao, Rui Pan, Shizhe Diao, Tong Zhang

arXiv preprint arXiv:2510.11769, 2025

PDF

2024

Empirical Studies on the Limitations of Direct Preference Optimization, and a Possible Quick Fix

Jiarui Yao, Yong Lin, Tong Zhang

2nd Workshop on Models of Human Feedback for AI Alignment, 2024

PDF

Featured

RLHF Workflow: From Reward Modeling to Online RLHF

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang

Transactions on Machine Learning Research (TMLR), 2024

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available.

PDF

* Equal contribution · α-β Alphabetical order