Table of contents

0 - Introduction

0.1 - Supervised Fine-Tuning

Supervised Fine-Tuning (SFT) [1] is a process where a pre-trained model is further trained on a specific dataset with labeled examples. This allows the model to adapt to more specific tasks or domains that were not included in the original training data. Specifically, the outputs, or target annotations, in the fine-tuning dataset typically come from the following sources:

Human Annotation:
- Manual Labeling: Human experts manually label the dataset. For example, in a question-answering task, human annotators provide correct answers for each question.
- Crowdsourcing: Platforms such as Amazon Mechanical Turk can be used to collect annotations from a larger pool of contributors. Guidelines and quality checks ensure that this data is reliable.
Existing High-Quality Datasets:
- Benchmark Datasets: There are numerous publicly available datasets that have been curated for specific tasks. Examples include the SQuAD dataset [2] for question-answering, the WMT datasets [3] for machine translation, and many others for different NLP tasks.
- Distillation of Large LLMs: We first collect a large prompt dataset and then collect the outputs come from the predictions made by the large, instruction-tuned teacher model, such as GPT or Claude Model [4, 5].

0.2 - Rejection Sampling Fine-tuning

However, Askell et al. [6] proposes that:

<aside> 💡

SFT alters the model’s expectations for the underlying data distribution.

</aside>

Specifically, SFT changes the model’s expectations for the data distribution $P(X)$, where we assume the $X$ is the answer set of the SFT dataset. Instead, prompting method asks the model for the distribution $P(X | C)$, where $C$ is the context.

Accordingly, Askell et al. [6] propose prompting and context distillation methods to replace the SFT approach. Additionally, recent works [7, 8, 9] have adopted the Rejection Sampling Fine-Tuning (RFT) method to address this issue. Specifically, RFT is designed to generate more accurate reasoning paths using an SFT model. For each problem, the SFT model generates multiple candidate responses. The best candidates are selected based on criteria such as correctness or logical consistency, which are often evaluated using a designated scoring function (e.g., a reward model or a verifier) or through manual inspection. The highest-scoring candidates are retained, and these high-quality model-generated examples are then used to fine-tune the model from scratch. This iterative process enhances its performance.

However, RFT method has two main weaknesses:

RFT requires sampling multiple candidate responses from an SFT model, which consumes significant effort and time.
Constructing a perfect environment, a robust reward model, or relying on manual inspection for selecting the best candidate in RFT often requires significant effort and time.

2 - Pre-Supervised Fine-Tuning for Selecting Suitable Responses in SFT Datasets

To address the above two weaknesses, we propose a combined approach that leverages both traditional SFT and RFT. Our goal is to achieve the performance benefits of RFT while minimizing the number of samples typically required by RFT.

<aside> 💡

We note that not all responses from human annotators or other large language models significantly influence the model’s underlying data distribution. Specifically, if the model can easily “learn” a response given a prompt during a few SFT epochs, the prompt-response pair does not significantly alter the model’s expectations for the data distribution. Accordingly, the core idea of our method is to distinguish those samples do not influence the model’s underlying data distribution in the SFT dataset.

</aside>

Here’s our new method:

Pre-SFT Stage: In the Pre-SFT stage, we first train the model on the SFT dataset. To determine whether the model can easily learn a particular prompt-response pair, we compute the BLEU score between the response in the dataset and the output from the Pre-SFT model given the same prompt.