This is a in-depth explanation of the original article from unsloth on applying reasoning to LLM’s

https://unsloth.ai/blog/r1-reasoning (unsloth uses grpo)

Section 1: Introduction

Hello, everyone! Today, we’re going to talk about how we can teach a large language model—like our student—to answer questions not only correctly but also in a clear and well-organized way. Think of it like teaching a bright student who needs both accuracy and neatness in their work. We want our student (the language model) to earn gold stars every time it gives us the right answer and explains its thought process nicely.

A Quick Overview of How Language Models Work

Before we dive into our teaching method, let’s briefly review what a large language model (LLM) is and how it works:

Token Prediction:

Large language models read text as a series of tokens (which are like individual words or parts of words). Their main job is to predict the next token in a sequence. For example, if the model sees "The cat sat on the," it might predict "mat" next. This process is done one token at a time.
Self-Attention:

These models use a technique called self-attention, which helps them decide which words in the sentence are important for predicting the next word. This mechanism lets the model understand the relationships between words—even if they are far apart.
Pretraining and Fine-Tuning:

Initially, these models are pretrained on huge amounts of text. This gives them a general understanding of language. Later, they are fine-tuned on specific tasks (like answering math questions) to adjust their behavior for particular needs.
Generating Output:

When the model writes an answer, it produces a probability for each word in its vocabulary at every step. The word with the highest probability (or one chosen by some randomness) is selected, and the process repeats until the answer is complete.

What Is REINFORCE and Why Do We Use It?

Now, let’s introduce an important concept: REINFORCE. Imagine you're teaching our student and, each time they answer a question, you decide whether to give them a gold star based on how well they did. In our case, the gold stars represent numerical rewards that tell the model, "Good job! Keep doing that!" Here's how it works:

Reinforcement Learning Basics:

In reinforcement learning, an agent (our model) learns by taking actions and receiving rewards. The goal is to maximize the total reward over time. For our student, every correct or well-organized answer is a good action that earns a reward.
The REINFORCE Algorithm:

REINFORCE is one of the simplest ways to achieve this. With REINFORCE, after the model generates an answer, we evaluate it using a set of reward functions. For example, we might give:
- A high reward if the final answer is correct.
- An additional reward if the answer is written in a clear format (with separate sections for reasoning and the final answer).
The model then uses these rewards to adjust its internal settings (its parameters), so in the future it is more likely to generate answers that earn lots of gold stars.

Our Specific Goals

So, what are we trying to achieve here?