Publications
Our research contributions to the scientific community
2025
Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
Zihan Wang, Rui Pan, Jiarui Yao, Róbert Csordás, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu
arXiv preprint arXiv:2506.18945, 2025
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Jiarui Yao, Yifan Hao, Hanning Zhang, Hanze Dong, Wei Xiong, Nan Jiang, Tong Zhang
Neurips 2025, 2025
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan et al.
arXiv preprint arXiv:2510.12693, 2025
FANS: Formal Answer Selection for Natural Language Math Reasoning Using Lean4
Jiarui Yao, Ruida Wang, Tong Zhang
EMNLP 2025, 2025
GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
Ruida Wang, Jiarui Yao, Rui Pan, Shizhe Diao, Tong Zhang
arXiv preprint arXiv:2510.11769, 2025
MiCRo: Mixture Modeling and Context-aware Routing for Personalized Preference Learning
Jingyan Shen, Jiarui Yao, Rui Yang, Yifan Sun, Feng Luo, Rui Pan, Tong Zhang, Han Zhao
EMNLP 2025, 2025
A Minimalist Approach to LLM Reasoning: From Rejection Sampling to Reinforcement Learning
Wei Xiong, Jiarui Yao, Yuhui Xu, Bo Pang, Lei Wang, Doyen Sahoo, Junnan Li, Nan Jiang, Tong Zhang, Caiming Xiong et al.
arXiv preprint arXiv:2504.11343, 2025
2024
Empirical Studies on the Limitations of Direct Preference Optimization, and a Possible Quick Fix
Jiarui Yao, Yong Lin, Tong Zhang
2nd Workshop on Models of Human Feedback for AI Alignment, 2024
RLHF Workflow: From Reward Modeling to Online RLHF
Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
Transactions on Machine Learning Research (TMLR), 2024
We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available.