Table of contents

0 - Introduction

Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches [1].

In this blog, we will explore the primary distinctions between Direct Preference Optimization (DPO) [2] and Reinforcement Learning from Human Feedback (RLHF), specifically focusing on Proximal Policy Optimization (PPO) [3] from a reinforcement learning (RL) perspective. Specifically, we highlight the following contrasts:

In summary, the lack of GAE estimation, the absence of a critic model, and the use of off-policy sampling in DPO result in high variance but unbiased token-wise reward estimates. This leads to a significant drawback for DPO: sample inefficiency. In the following sections, we outline the detailed differences between DPO and PPO and use a series of experiments to uncover a limitation of the DPO algorithm: DPO struggles to distinguish response pairs with substantial token overlap while still attempting to maximize the difference between them, which may result in reduced likelihoods for both positive and negative samples.

1 - Contrasting DPO and PPO from an RL Perspective

In this section, we will introduce the DPO and PPO algorithms, demonstrate their mathematical equivalence, and explore their differences. We aim to provide readers with a comprehensive comparison of DPO and PPO from a traditional RL perspective.

1.1 - Preliminary

In a traditional RLHF [14][15], the Bradley-Terry model is utilized to represent the preference function $p(y \succ y^{'}|x)$ as a sigmoid function of the difference between rewards:

$$ p(y\succ y^{'}|x)=\sigma(r(x,y)-r(x,y^{'}))\ \ \ (1) $$

where $\sigma$ denotes the sigmoid function, which serves as a normalization mechanism. Given an empirical pair-wise human preference dataset $D = (x_i, y \succ y^{'})$, the reward function can be inferred by minimizing the logistic regression loss:

$$ L(r) = -E_{(x,y,y^{'})\sim D}[\log(p(y \succ y^{'}|x))] \ \ \ (2) $$

$$ L(r) = -E_{(x,y,y^{'})\sim D}[\log(\sigma(r(x, y) - r(x, y^{'})))] \ \ \ (3) $$