Over the past few years, Reinforcement Learning has become the driving force behind the dramatic leaps in AI research. From OpenAI’s RLHF to DeepSeek’s reasoning optimization and Qwen’s advanced training pipelines, the story of RL language models is a story of stability, scalability and signal precision.
While this blog is about GSPO, we have to take a step back to analyze its big bros: PPO and GRPO.
First, let’s talk about an important concept in RL: Importance Sampling.
Importance Sampling
When training a Large Language Model with RL, we often face this problem: we want to estimate how good a new policy is, but our samples were generated by an old policy. So how do we make sure updates from the old samples still point in the right direction? That’s where Importance Sampling comes in.
Suppose you want to estimate the expected value of a function \(f(x)\) under a target distribution \(\pi_{\text{tar}}\), but you can only sample from a different behaviour distribution \(\pi_{\text{beh}}\). Importance Sampling provides a way to estimate the expected value by reweighting the samples from \(\pi_{\text{beh}}\).
That fraction \(\frac{\pi_{\text{tar}}(z)}{\pi_{\text{beh}}(z)}\) is called the importance weight. It acts like a translator, it adjusts old samples so they represent what the new policy would have done. In RL, with Language Models:
- The old policy is the model before the update \(\pi_{\theta_{\text{old}}}\)
- The new policy is the model we’re training \(\pi_{\theta}\)
- A sample \(z\) is a generated response \(y\) or sometimes a token \(y_{\text{t}}\). So the importance weight becomes \(\frac{\pi_{\theta}(z)}{\pi_{\theta_{\text{old}}}(z)}\)
That being said, let’s dive into our main card.
PPO: Proximal Policy Optimization
In PPO, the model (the policy) generates responses, receives a scalar reward signal \(r\) from the reward model, and updates itself by comparing the probability of the new policy to that of the old one. At its core, PPO is built around a simple but profound idea: learn from past experiences but don’t stray too far from what worked before. When training with RL, we collect data: prompts and responses generated by the old policy \(\pi_{\theta_{\text{old}}}\) (think of it like an older version of the model). However, when we update the model’s parameters to a new version \(\pi_{\theta}\), its probability distribution over responses changes. This creates a mismatch: our training samples came from one policy but we’re now optimizing another one.
PPO addresses this mismatch by weighting each sample according to how much more likely it is under the new policy compared to the old one. For every generated token, PPO computes an importance ratio \(w_t(\theta) = \frac{\pi_\theta(y_t|x, y_{<t})}{\pi_{\theta_{\text{old}}}(y_t|x, y_{<t})}\)
This ratio acts as a “trust coefficient”.
- If the new model \(\pi_\theta\) assigns a much higher probability \(w_{\text{t}} >> 1\) the sample may be overemphasized, leading to unstable updates.
- If the new model \(\pi_\theta\) assigns a much lower probability \(w_{\text{t}} << 1\) the sample may be underemphasized, leading to suboptimal updates.
This leads to PPO’s clipped objective:
- \(\hat{A}_t\) is the estimated advantage of the \(t\)-th token, it defines how much better a token is than the average. The advantage is computed via the Generalized Advantage Estimation (GAE) which involves a value model.
- \(\text{clip}(w_t(\theta), 1 - \epsilon, 1 + \epsilon)\) is the clipping term that ensures the updates are stable. clipping makes PPO proximal: each update stays close to the previous one ensuring stable learning while still allowing gradual improvement.
Despite its elegance, PPO has a major limitation: it needs a value model as big as the policy to compute the advantage introducing a considerable memory and computational burden
GRPO: Group Relative Policy Optimization
Now that we have a good understanding of PPO, it’s super easy to understand GRPO. GRPO keeps the spirit of PPO’s importance sampling but removes the value model entirely.
For each prompt \(x\), the model generates a group of \(G\) responses. Each response receives a scalar reward from a verifier (reward model, external signal) \(r(x,y_i)\). Then, instead of using a learned value baseline, GRPO computes a relative advantage for each response.
Once the relative advantages are computed, GRPO applies a PPO-like update for each token \(y_{\text{i,t}}\) in response \(y_i\).
The objective becomes:
This makes GRPO computationally elegant, lightweight and surprisingly effective for tasks like mathematical reasoning, coding and instruction following. The problem with GRPO is how its importance sampling is applied. The importance weights \(w_{i,t}\) are meant to correct the difference between the old and new policy. However, GRPO applies these corrections at the token level while the reward is given at the sequence level. This breaks the theoretical foundation of importance sampling. The results is high variance gradients.
GSPO: Group Sequence Policy Optimization
GSPO and GRPO share the same workflow, the only and highly important difference is the way the importance sampling is applied. Compared to the previous algorithms, GSPO uses the geometric mean , thus redefining the importance ratio at sequence level.
Using this sequence level importance ratio, the objective becomes:
Just like in GRPO:
- \(G\) is the number of responses generated for each prompt \(x\).
- \(\hat{A}_i\) is the normalized advantage of the \(i\)-th response.
Every token in a response shares the same scaling factor \(s_i(\theta)\). This means the gradient contributions from all tokens move in a consistent direction. No more noisy tug-of-war between tokens in the same sentence.
The results from the tests made by the Qwen Team are impressive:
- Higher Sample Efficiency: It achieved better benchmark scores with the same number of queries.
- MoE stability: Mixture-of-Experts models trained without any Routing Replay.
Since the importance ratio is sequence based, it is unaffected by which experts activate for a given token. The MoE routing noise simply averages out at the sequence level.
Sometimes tasks demand finer granularity, for example in multi turn reasoning or step wise reward setups. To address this, there is GSPO-token , a variant that keeps GSPO’s stable sequence level ratio but allows token specific advantages. It’s mathematically equivalent to GSPO when all token advantages are equal, but more flexible for tasks that need local credit assignment.
where