👋 Welcome to Wei Shen’s LLM Blog
Email: [email protected]
Authors: Wei Shen, Chuheng Zhang, Liang Zeng
TL;DR: In-Context Learning (ICL) empowers Large Language Models (LLMs) with the ability to learn directly from a limited set of examples within their context, enabling them to generalize to new tasks without the need for explicit gradient updates. This capability, recognized as an emerging capacity of LLMs, has attracted researchers’ attention to uncover its underlying mechanisms. Additionally, the pre-training and alignment processes of models like ChatGPT and GPT-4 have attracted considerable attention. Numerous studies are investigating the performance of LLMs in these phases, particularly their applications as chatbots for end-users. This raises questions about the interplay between these paradigms and how ICL might enhance chat model performance.
System,Mathematics and Code in TRL PPO
Authors: Yunhui Xia, Wei Shen
TL;DR: TRL is a full stack library that provides a set of tools to train transformer language models with Reinforcement Learning, from the Supervised Fine-tuning step (SFT), Reward Modeling step (RM) to the Proximal Policy Optimization (PPO) step. The library is integrated with 🤗 transformers. In this blog, we will introduce both system architecture, code and mathematics of PPO in TRL. Specifically, We split this blog into three parts: 1) An introduction to TRL PPO system architecture, 2) Mathematics in PPO algorithm, 3) code in PPO Trainer.
Authors: Wei Shen
TL;DR: In this blog, we contrast Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO) from a reinforcement learning perspective, highlighting a key shortcoming: sample inefficiency. Due to insufficient training samples and mainly training on off-policy samples, DPO encounters the state distribution shift problem. Furthermore, as a Bradley-Terry model, DPO tends to overfit simpler pairwise samples while disregarding more complex ones. This interplay between the state distribution shift problem and the limitations of the Bradley-Terry model can result in reduced likelihoods for both positive and negative samples.
Advanced Tricks for Training Large Language Models with Proximal Policy Optimization