Shangmin Guo@University of Edinburgh | [email protected]

Wei Xiong@University of Illinois Urbana-Champaign | [email protected]

Alphabetical order

ThankĀ Hanze Dong@Salesforce, Tianqi Liu@Google, Wei Shen@Ernie code team, Haoxiang Wang@UIUC, for insightful feedback on an early draft of this blog.

Date: Mar 26, 2024

To readers:

TL; DR:

Reinforcement learning from human feedback (RLHF) is a leading technique to adapt the outputs of generative models to be preferred by human and has achieved tremendous success in ChatGPT by OpenAI, Claude by Anthropic, and Gemini by Google. Inspired by these successes, preference optimization (a slightly more general terminology that also contains the RL-free algorithms) has attracted significant attentions in the past year. In this blog, we aim to present a comprehensive introduction to the frontier research in this exciting field, explore the on-going challenges, and discuss the interesting research problems for the future.

Table of Content

  1. Prerequisites
    1. Alignment Objective
    2. Pre-training and Instruction-following Fine-tuning
    3. Preference Data Collection, Reward, and Bradley-Terry Model
    4. On/off-policy and On/off-line Learning in the Context of Alignment
  2. RLHF: The Classic Framework to Make ChatGPT
    1. Instruct-GPT: a Three-stages Approach
    2. Online Iterative RLHF
  3. RL-Free Framework: SLIC, DPO, IPO, and More
    1. Direct Preference Optimization (DPO) and Online Variants
    2. Identity-preference Optimization (IPO)
    3. Sequence Likelihood Calibration (SLiC)
    4. Comparison between DPO, IPO and SLiC
    5. Rejection Sampling in RLHF
  4. Miscellaneous
    1. Reward Modeling in RLHF
    2. Evaluation in RLHF
    3. Theoretical Understanding of RLHF: Why we Should Choose Online RLHF/DPO?
    4. Alignment without External Preference Signals
  5. Beyond the Bradley-Terry Model
    1. Nash Learning: Dropping the Reward Model
    2. Multi-objective Learning and Human-preference-aware Alignment
    3. Pointwise Feedback - Kahneman-Tversky Optimization
  6. Other Research Directions; End note