Table of contents
Supervised Fine-Tuning (SFT) [1] is a process where a pre-trained model is further trained on a specific dataset with labeled examples. This allows the model to adapt to more specific tasks or domains that were not included in the original training data. Specifically, the outputs, or target annotations, in the fine-tuning dataset typically come from the following sources:
However, Askell et al. [6] proposes that:
<aside> 💡
SFT alters the model’s expectations for the underlying data distribution.
</aside>
Accordingly, Askell et al. [6] propose prompting and context distillation methods to replace the SFT approach. Additionally, recent works [7, 8, 9] have adopted the Rejection Sampling Fine-Tuning (RFT) method to address this issue. Specifically, RFT is designed to generate more accurate reasoning paths using an SFT model. For each problem, the SFT model generates multiple candidate responses. The best candidates are selected based on criteria such as correctness or logical consistency, which are often evaluated using a designated scoring function (e.g., a reward model or a verifier) or through manual inspection. The highest-scoring candidates are retained, and these high-quality model-generated examples are then used to fine-tune the model from scratch. This iterative process enhances its performance.
However, RFT method has one main weaknesses:
<aside> 💡
RFT requires sampling multiple candidate responses from an SFT model, which consumes significant effort and time.
</aside>
To address the highlighted weakness, we propose a combined approach that leverages both traditional SFT and RFT. Our goal is to achieve the performance benefits of RFT while minimizing the number of samples typically required by RFT.
<aside> 💡
We note that not all responses from human annotators or other large language models significantly influence the model’s underlying data distribution. Specifically, if the model can easily “learn” a response given a prompt during a few SFT epochs, the prompt-response pair does not significantly alter the model’s expectations for the data distribution. Accordingly, the core idea of our method is to distinguish those samples do not influence the model’s underlying data distribution in the SFT dataset.
</aside>
Here’s our new method: