https://github.com/githubpradeep/notebooks/blob/main/rl final.py

https://github.com/githubpradeep/notebooks/blob/main/reasoning_demo.ipynb

https://unsloth.ai/blog/r1-reasoning (unsloth uses grpo)

This is a in-depth explanation of the original article from unsloth on applying reasoning to LLM’s

https://unsloth.ai/blog/r1-reasoning (unsloth uses grpo)

Section 1: Introduction

Hello, everyone! Today, we’re going to talk about how we can teach a large language model—like our student—to answer questions not only correctly but also in a clear and well-organized way. Think of it like teaching a bright student who needs both accuracy and neatness in their work. We want our student (the language model) to earn gold stars every time it gives us the right answer and explains its thought process nicely.

A Quick Overview of How Language Models Work

Before we dive into our teaching method, let’s briefly review what a large language model (LLM) is and how it works:

What Is REINFORCE and Why Do We Use It?

Now, let’s introduce an important concept: REINFORCE. Imagine you're teaching our student and, each time they answer a question, you decide whether to give them a gold star based on how well they did. In our case, the gold stars represent numerical rewards that tell the model, "Good job! Keep doing that!" Here's how it works:

Our Specific Goals

So, what are we trying to achieve here?