Thank Chenghao Yang, Chujie Zheng, Daya Guo, Chengpeng Li , Pengyu Zhao for insightful feedback on an early draft of this blog.

TL;DR: ****In-Context Learning (ICL) is one of the emerging capacities of LLMs. In this blog, we explore the key elements influencing its effectiveness and introduce the Bayesian inference view of ICL. Additionally, we explore the potential of leveraging these elements to enhance both the pre-training and the alignment of chat-based LLMs, proposing new pathways to boost chat-based LLM performance.

Table of contents

0 - Introduction

In-context learning (ICL) [1, 2, 3, 4, 5] is one of the emerging capabilities of LLMs, which was first introduced in the GPT3 paper [6]. Building on GPT-3, OpenAI proposed ChatGPT and GPT-4 [7, 8], which involve pre-training and alignment processes to develop chat-based models. Subsequently, numerous studies [9, 10, 11, 12] have investigated how to improve the performance of chat-based LLMs during both pre-training and alignment. Rather than focusing solely on ICL or chatbot performance enhancement, this blog explores how ICL might improve chat-based model performance and the interplay between ICL, pre-training, and alignment.

In this blog, we summarize various perspectives on the mechanisms behind ICL and delve into pivotal elements that influence its effectiveness in Sections 1 and 2. Moreover, we explore the potential of leveraging these pivotal elements to enhance both model pre-training and alignment [7], proposing new pathways to improve chat-based LLM performance. ****Specifically, in Section 3, we review In-Context Pre-training methods that enhance model pre-training by incorporating more long-coherent or structured documents into the pre-training dataset. Additionally, in Section 4, we summarize In-Context Supervised Fine-Tuning (SFT) methods that include more related context from the pre-training dataset into the SFT datasets or utilize specialized data formats. These approaches bridge the gap between pre-training and SFT phases, thereby optimizing SFT. Finally, we review papers that investigate the interplay between ICL, pre-training, and SFT, aiming to provide insights that could pave the way for future enhancements in LLM performance.

1 - In-Context Learning

In this section, we aim to answer the following questions:

1.1 - Introduction

During ICL [1], we give the LM a prompt that demonstrates a task. This prompt typically includes a list of input-output pairs and an input awaiting prediction. The LLM learns these input-output pairs and generates the answer to this input just during inference. As depicted in subsequent figures, the prompts are augmented with a series of input-output pairs, prompting the LM to generate predictions by merely conditioning on the provided prompt and an input awaiting prediction. To generate an accurate response to the prompt, the model must analyze input-output pairs to learn the task from the input distribution (financial or general news), output distribution (positive/negative sentiment or specific topics), and the mapping between input and output (for sentiment or topic classification). Additionally, it must consider the sample order (order of demonstrations), number (e.g., three pairs in the example), and data format (with "//" representing the prediction symbol) of the pairs in complex tasks. However, this blog posits that certain elements within the prompt's phrasing are crucial in determining the effectiveness of ICL. Additionally, we explore which aspects of the pre-training phase might affect the model's acquisition of ICL ability.

Figure: In Sang et al’s framework [1], the LM uses the training examples to internally figure out that the task is either sentiment analysis (left) or topic classification (right) and apply the same mapping to the test input.

Figure: In Sang et al’s framework [1], the LM uses the training examples to internally figure out that the task is either sentiment analysis (left) or topic classification (right) and apply the same mapping to the test input.

1.2 - Conclusion

1.2.1 - Pivotal elements influencing the model’s ICL performance during inference