https://github.com/githubpradeep/notebooks/blob/main/rl final.py
https://github.com/githubpradeep/notebooks/blob/main/reasoning_demo.ipynb
https://unsloth.ai/blog/r1-reasoning (unsloth uses grpo)
https://unsloth.ai/blog/r1-reasoning (unsloth uses grpo)
Hello, everyone! Today, we’re going to talk about how we can teach a large language model—like our student—to answer questions not only correctly but also in a clear and well-organized way. Think of it like teaching a bright student who needs both accuracy and neatness in their work. We want our student (the language model) to earn gold stars every time it gives us the right answer and explains its thought process nicely.
Before we dive into our teaching method, let’s briefly review what a large language model (LLM) is and how it works:
Token Prediction:
Large language models read text as a series of tokens (which are like individual words or parts of words). Their main job is to predict the next token in a sequence. For example, if the model sees "The cat sat on the," it might predict "mat" next. This process is done one token at a time.
Self-Attention:
These models use a technique called self-attention, which helps them decide which words in the sentence are important for predicting the next word. This mechanism lets the model understand the relationships between words—even if they are far apart.
Pretraining and Fine-Tuning:
Initially, these models are pretrained on huge amounts of text. This gives them a general understanding of language. Later, they are fine-tuned on specific tasks (like answering math questions) to adjust their behavior for particular needs.
Generating Output:
When the model writes an answer, it produces a probability for each word in its vocabulary at every step. The word with the highest probability (or one chosen by some randomness) is selected, and the process repeats until the answer is complete.
Now, let’s introduce an important concept: REINFORCE. Imagine you're teaching our student and, each time they answer a question, you decide whether to give them a gold star based on how well they did. In our case, the gold stars represent numerical rewards that tell the model, "Good job! Keep doing that!" Here's how it works:
Reinforcement Learning Basics:
In reinforcement learning, an agent (our model) learns by taking actions and receiving rewards. The goal is to maximize the total reward over time. For our student, every correct or well-organized answer is a good action that earns a reward.
The REINFORCE Algorithm:
REINFORCE is one of the simplest ways to achieve this. With REINFORCE, after the model generates an answer, we evaluate it using a set of reward functions. For example, we might give:
The model then uses these rewards to adjust its internal settings (its parameters), so in the future it is more likely to generate answers that earn lots of gold stars.
So, what are we trying to achieve here?