Exploration-Driven Policy Optimization in RLHF: Theoretical Insights on Efficient Data Utilization
Abstract
Reinforcement Learning from Human Feedback (RLHF) has achieved impressive empirical successes while relying on a small amount of human feedback. However, there is limited theoretical justification for this phenomenon. Additionally, most recent studies focus on value-based algorithms despite the recent empirical successes of policy-based algorithms. In this work, we consider an RLHF algorithm based on policy optimization (PO-RLHF). The algorithm is based on the popular Policy Cover-Policy Gradient (PC-PG) algorithm, which assumes knowledge of the reward function. In PO-RLHF, knowledge of the reward function is not assumed, and the algorithm uses trajectory-based comparison feedback to infer the reward function. We provide performance bounds for PO-RLHF with low query complexity, and exhibit an insight into why a small amount of human feedback may be sufficient to achieve good performance with RLHF. We propose and analyze algorithms PG-RLHF and NN-PG-RLHF for two important settings: linear and neural function approximation, respectively.
Bio
Dr. Yihan Du is currently a postdoctoral researcher at the University of Illinois Urbana-Champaign, working with Prof. R. Srikant. Her research interests lie in machine learning, with emphases on reinforcement learning and online learning. Dr. Du obtained her Ph.D. degree from the Institute for Interdisciplinary Information Sciences (headed by Prof. Andrew Chi-Chih Yao) at Tsinghua University in 2023. She has published several papers in top conferences in machine learning, including ICML, NeurIPS, ICLR and AAAI. Dr. Du also received several honors, including the China Computer Federation (CCF) Agent and Multi-Agent System Doctoral Dissertation Award, and the Tsinghua Outstanding Doctoral Dissertation Award.
Event Contact: Iam-Choon Khoo