Table of contents

0 - Introduction

0.1 - Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) [1] is a process where a pre-trained model is further trained on a specific dataset with labeled examples. This allows the model to adapt to more specific tasks or domains that were not included in the original training data. Specifically, the outputs, or target annotations, in the fine-tuning dataset typically come from the following sources:

0.2 - Rejection Sampling Fine-tuning

However, Askell et al. [6] proposes that:

<aside> 💡

SFT alters the model’s expectations for the underlying data distribution.

</aside>

Accordingly, Askell et al. [6] propose prompting and context distillation methods to replace the SFT approach. Additionally, recent works [7, 8, 9] have adopted the Rejection Sampling Fine-Tuning (RFT) method to address this issue. Specifically, RFT is designed to generate more accurate reasoning paths using an SFT model. For each problem, the SFT model generates multiple candidate responses. The best candidates are selected based on criteria such as correctness or logical consistency, which are often evaluated using a designated scoring function (e.g., a reward model or a verifier) or through manual inspection. The highest-scoring candidates are retained, and these high-quality model-generated examples are then used to fine-tune the model from scratch. This iterative process enhances its performance.

However, RFT method has one main weaknesses:

<aside> 💡

RFT requires sampling multiple candidate responses from an SFT model, which consumes significant effort and time.

</aside>

2 - Pre-Supervised Fine-Tuning for Selecting Suitable Responses in SFT Datasets

To address the highlighted weakness, we propose a combined approach that leverages both traditional SFT and RFT. Our goal is to achieve the performance benefits of RFT while minimizing the number of samples typically required by RFT.

<aside> 💡

We note that not all responses from human annotators or other large language models significantly influence the model’s underlying data distribution. Specifically, if the model can easily “learn” a response given a prompt during a few SFT epochs, the prompt-response pair does not significantly alter the model’s expectations for the data distribution. Accordingly, the core idea of our method is to distinguish those samples do not influence the model’s underlying data distribution in the SFT dataset.

</aside>

Here’s our new method:

  1. Pre-SFT Stage: In the Pre-SFT stage, we first train the model on the SFT dataset. To determine whether the model can easily learn a particular prompt-response pair, we compute the BLEU score between the response in the dataset and the output from the Pre-SFT model given the same prompt.
  2. Evaluation and Adjustment: If the model does not easily learn a prompt-response pair (indicated by a low BLEU score), we then employ RFT to obtain a model-self-generated response to replace the response under this prompt.