Table of contents

0 - Introduction

Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches [1].

In this blog, we will explore the primary distinctions between Direct Preference Optimization (DPO) [2] and Reinforcement Learning from Human Feedback (RLHF), specifically focusing on Proximal Policy Optimization (PPO) [3] from a reinforcement learning (RL) perspective. Specifically, we highlight the following contrasts:

1 - Contrasting DPO and PPO from an RL Perspective

In this section, we will introduce the DPO and PPO algorithms, demonstrate their mathematical equivalence, and explore their differences. We aim to provide readers with a comprehensive comparison of DPO and PPO from a traditional RL perspective.

1.1 - Preliminary

In a traditional RLHF [14][15], the Bradley-Terry model is utilized to represent the preference function $p(y \succ y^{'}|x)$ as a sigmoid function of the difference between rewards:

$$ p(y\succ y^{'}|x)=\sigma(r(x,y)-r(x,y^{'}))\ \ \ (1) $$

where $\sigma$ denotes the sigmoid function, which serves as a normalization mechanism. Given an empirical pair-wise human preference dataset $D = (x_i, y \succ y^{'})$, the reward function can be inferred by minimizing the logistic regression loss:

$$ L(r) = -E_{(x,y,y^{'})\sim D}[\log(p(y \succ y^{'}|x))] \ \ \ (2) $$

$$ L(r) = -E_{(x,y,y^{'})\sim D}[\log(\sigma(r(x, y) - r(x, y^{'})))] \ \ \ (3) $$

With the learned reward function $r(x, y)$, the objective of RLHF is to optimize for the policy $\pi$ to maximize the expected reward concurrently minimizing the distance between $\pi$ and some reference policy $\pi_{ref}$ through the following KL-regularized objective function: