Table of contents

0 - Introduction

Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models. In response to the complex nature of the classical RLHF pipeline, direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach. Although DPO solves the same objective as the standard RLHF setup, there is a mismatch between the two approaches [1].

In this blog, we will explore the primary distinctions between Direct Preference Optimization (DPO) [2] and Reinforcement Learning from Human Feedback (RLHF), specifically focusing on Proximal Policy Optimization (PPO) [3] from a reinforcement learning (RL) perspective. Specifically, we highlight the following contrasts:

DPO conceptualizes the response generation of large language models (LLMs) as a multi-arm bandit problem [4], while PPO views this process through the lens of Markov Decision Processes (MDP) [5].
DPO employs a Monte Carlo method [6] to sample pairs of responses generated by LLMs and estimate human preferences through pairwise comparisons. In contrast, PPO uses Generalized Advantage Estimation (GAE), a refined version of n-step Temporal Difference (TD) learning [7], to estimate the token-wise reshaped rewards of the responses.
DPO utilizes the Bradley-Terry model [8] to learn human preferences on responses generated by LLMs from pairwise human preference data. Conversely, PPO uses a weighted logistic model [9] to learn point-wise rankings of tokens within specific contexts.
DPO is a variant of the Reinforce Algorithm [10], also known as the Monte Carlo Policy Gradient Algorithm. On the other hand, PPO is a variant of the Actor-Critic Algorithm [11], which improves upon the Reinforce framework by introducing a separate value function (the critic) that evaluates the actions generated by the policy (the actor).
DPO is an off-policy method [12] since it learns the policy from an offline dataset that might not be generated by the current policy (i.e., the model during DPO training). Conversely, PPO is an on-policy algorithm [13], as it directly relies on the data generated by the current policy for updates.

In summary, the lack of GAE estimation, the absence of a critic model, and the use of off-policy sampling in DPO result in high variance but unbiased token-wise reward estimates. This leads to a significant drawback for DPO: sample inefficiency. In the following sections, we outline the detailed differences between DPO and PPO and use a series of experiments to uncover a limitation of the DPO algorithm: DPO struggles to distinguish response pairs with substantial token overlap while still attempting to maximize the difference between them, which may result in reduced likelihoods for both positive and negative samples.

1 - Contrasting DPO and PPO from an RL Perspective

In this section, we will introduce the DPO and PPO algorithms, demonstrate their mathematical equivalence, and explore their differences. We aim to provide readers with a comprehensive comparison of DPO and PPO from a traditional RL perspective.

1.1 - Preliminary

In a traditional RLHF [14][15], the Bradley-Terry model is utilized to represent the preference function $p(y \succ y^{'}|x)$ as a sigmoid function of the difference between rewards:

$$ p(y\succ y^{'}|x)=\sigma(r(x,y)-r(x,y^{'}))\ \ \ (1) $$

where $\sigma$ denotes the sigmoid function, which serves as a normalization mechanism. Given an empirical pair-wise human preference dataset $D = (x_i, y \succ y^{'})$, the reward function can be inferred by minimizing the logistic regression loss:

$$ L(r) = -E_{(x,y,y^{'})\sim D}[\log(p(y \succ y^{'}|x))] \ \ \ (2) $$

$$ L(r) = -E_{(x,y,y^{'})\sim D}[\log(\sigma(r(x, y) - r(x, y^{'})))] \ \ \ (3) $$