👋 Welcome to Wei Shen’s LLM Blog

Email: [email protected]

Google Scholar: https://scholar.google.com/citations?hl=en&user=M_bRlb8AAAAJ&view_op=list_works&gmla=AOAOcb2h_4hMB8KGhn_LHG3Ziuv9biv3vnCG6dD5CdXp5jOjpLQdAx-3I7jviY-MlmIkCObgD_OSl_qxmSdFw6-p

Foundational Mechanics of Large Language Models

In-Context Learning

Exploring the Potential of In-Context Learning: New Pathways for Enhancing Chat-Based Large Language Model Performance

Authors: Wei Shen, Chuheng Zhang, Liang Zeng

TL;DR: In-Context Learning (ICL) empowers Large Language Models (LLMs) with the ability to learn directly from a limited set of examples within their context, enabling them to generalize to new tasks without the need for explicit gradient updates. This capability, recognized as an emerging capacity of LLMs, has attracted researchers’ attention to uncover its underlying mechanisms. Additionally, the pre-training and alignment processes of models like ChatGPT and GPT-4 have attracted considerable attention. Numerous studies are investigating the performance of LLMs in these phases, particularly their applications as chatbots for end-users. This raises questions about the interplay between these paradigms and how ICL might enhance chat model performance.


Alignment Strategy

Reinforcement Learning From Human Feedback

System,Mathematics and Code in TRL PPO

Authors: Yunhui Xia, Wei Shen

TL;DR: TRL is a full stack library that provides a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. The library is integrated with 🤗 transformers. In this blog, we will introduce both system architecture, code and mathematics of PPO in TRL. Specifically, We split this blog into three parts: 1) An introduction to TRL PPO system architecture, 2) Mathematics in PPO algorithm, 3) code in PPO Trainer.


Challenges of Sample Inefficiency (CSI): Practical Limitations of Direct Preference Optimization Algorithm

Authors: Wei Shen

TL;DR: In this blog, we contrast Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO) from a reinforcement learning perspective, highlighting a key shortcoming: sample inefficiency. Due to insufficient training samples and mainly training on off-policy samples, DPO encounters the state distribution shift problem. Furthermore, as a Bradley-Terry model, DPO tends to overfit simpler pairwise samples while disregarding more complex ones. This interplay between the state distribution shift problem and the limitations of the Bradley-Terry model can result in reduced likelihoods for both positive and negative samples.


Advanced Tricks for Training Large Language Models with Proximal Policy Optimization