- RLHF[1] is the pivotal technology that transformed ChatGPT from merely "being able to talk" to "talking well," and forms the core methodology of InstructGPT — training a reward model from human feedback and then optimizing the language model policy with PPO[4]
- DPO[3] skips Reward Model training entirely, optimizing the policy directly from preference data and dramatically reducing alignment costs — Zephyr[9], a 7B model trained with DPO, outperformed a 70B RLHF model
- DeepSeek-R1[7]'s GRPO[11] method demonstrated that pure RL — without any human annotations — can elicit reasoning capabilities, producing an "aha moment" where the model spontaneously learns to reflect and verify
- Alignment techniques are shifting from a "human-annotation-driven" paradigm toward "self-rewarding"[12] and "group reinforcement" — this article includes two hands-on Google Colab labs: DPO fine-tuning and Reward Model training
1. Why Do LLMs Need Alignment? The Critical Turning Point from GPT-3 to ChatGPT
GPT-3, with its 175 billion parameters, could generate fluent text, yet it frequently "refused to cooperate" — ask it a simple question, and it might produce a Wikipedia-style encyclopedia entry; ask it to write code, and it might output something that looks correct but contains logical errors; worse still, it would generate harmful or biased content without hesitation.
The root cause lies in a fundamental gap: pretraining only teaches the model to "predict the next token," but never teaches it what constitutes a "good answer."
Pretraining objective: max Σ log P(x_t | x_1, ..., x_{t-1})
→ Learns statistical patterns of language, but not "what makes a useful answer"
Alignment objective: max E_{x~prompt}[R(x, y)] - β·KL[π_θ(y|x) || π_ref(y|x)]
→ Maximizes human preferences while preserving language capabilities
Pretraining ≠ Useful:
User: "What is the capital of France?"
Unaligned: "What is the capital of France? This is a geography question. France is located in Western Europe..." (continuation)
Aligned: "The capital of France is Paris." (answers the question)
In 2022, OpenAI published InstructGPT[1], demonstrating a remarkable result: a 1.3B model trained with RLHF outperformed the unaligned 175B GPT-3 in human evaluations. This meant alignment not only did not impair model capabilities but actually unlocked knowledge already learned during pretraining — the so-called Alignment Bonus.
The success of InstructGPT directly gave rise to ChatGPT, and RLHF became the standard training pipeline for large language models. However, the complexity and cost of RLHF also spurred the exploration of simpler methods — from DPO to GRPO, alignment techniques entered an era of rapid innovation.
2. The Complete RLHF Pipeline: SFT → Reward Model → PPO
The core idea of RLHF (Reinforcement Learning from Human Feedback) originates from the pioneering work of Christiano et al.[2] in robotic control: humans cannot write precise reward functions, but they can easily compare the quality of two outcomes. InstructGPT[1] systematically applied this idea to language models, establishing a three-stage training pipeline.
Stage 1: Supervised Fine-Tuning (SFT)
Starting from the pretrained model, supervised fine-tuning is performed using high-quality instruction-response pairs written by human annotators. InstructGPT used approximately 13,000 human-written demonstration examples.
SFT Loss Function:
L_SFT = -Σ log P_θ(y_t | x, y_1, ..., y_{t-1})
x: instruction (prompt)
y: human-annotated ideal response
θ: model parameters
Role of SFT:
Pretrained model → Learns "conversational format" and "instruction following"
But SFT data is limited, so the model may still produce inappropriate responses
→ Further optimization with RL is needed
Stage 2: Reward Model Training
The reward model is the core component of RLHF. It learns human preference judgments: for the same prompt, human annotators rank multiple responses by quality, and the reward model learns to predict this ranking[5].
Bradley-Terry Preference Model:
P(y_w ≻ y_l | x) = σ(r_φ(x, y_w) - r_φ(x, y_l))
y_w: human-preferred response (winner)
y_l: human-dispreferred response (loser)
r_φ: reward model, outputs a scalar score
σ: sigmoid function
RM Training Loss:
L_RM = -E_{(x, y_w, y_l) ~ D}[log σ(r_φ(x, y_w) - r_φ(x, y_l))]
→ Maximizes the reward gap between preferred and dispreferred responses
InstructGPT RM Training:
- 33,000 prompts, each with 4-9 responses
- Annotators provided complete rankings (not just pairwise comparisons)
- Each ranking produces C(K,2) preference pairs, greatly improving data efficiency
The quality of the reward model directly determines the upper bound of RLHF. If the reward model learns incorrect preferences (e.g., favoring verbose responses), the entire RLHF training will optimize in the wrong direction — this is known as Reward Hacking. RewardBench[13] provides a systematic benchmark for evaluating reward models.
Stage 3: PPO Reinforcement Learning Optimization
With a reward model in hand, we can use reinforcement learning to optimize the language model. PPO (Proximal Policy Optimization)[4] is currently the most widely used algorithm, as it strikes a good balance between stability and efficiency.
RLHF RL Objective Function:
max_{π_θ} E_{x~D, y~π_θ(·|x)}[r_φ(x, y)] - β·KL[π_θ(y|x) || π_ref(y|x)]
π_θ: current policy (the language model being trained)
π_ref: reference policy (frozen model after SFT)
r_φ: reward model score
β: KL penalty coefficient (controls deviation from the SFT model)
Role of KL Divergence Constraint:
- Prevents the model from generating unnatural text to chase high rewards
- Maintains language fluency and diversity
- Avoids Reward Hacking (exploiting reward model vulnerabilities)
PPO Clipped Objective:
L_PPO = E[min(r_t(θ)·A_t, clip(r_t(θ), 1-ε, 1+ε)·A_t)]
r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t) (policy ratio)
A_t: advantage function
ε: clipping range (typically 0.1-0.2)
RLHF with PPO requires 4 models running simultaneously:
1. Actor (policy model): generates responses
2. Critic (value model): estimates state values
3. Reward Model: scores responses
4. Reference Model: computes KL penalty
→ Massive memory overhead — the primary engineering challenge of RLHF
Anthropic's research[6] further revealed an important property of RLHF: it can simultaneously optimize for "helpfulness" and "harmlessness," but tension exists between the two — over-optimizing for harmlessness makes the model conservative and unhelpful, while over-optimizing for helpfulness may produce harmful content. During the training of Llama 2[8], Meta used two independent reward models to separately optimize these two dimensions.
3. DPO: An Elegant Simplification That Skips the Reward Model
Although RLHF is effective, its engineering complexity is extremely high: it requires training a separate reward model, loading four models simultaneously, and tuning PPO hyperparameters. In 2023, Rafailov et al.[3] proposed DPO (Direct Preference Optimization), proving mathematically a striking conclusion — your language model itself is an implicit reward model.
Mathematical Derivation from RLHF to DPO
DPO's derivation starts from the optimal solution of RLHF. The KL-constrained RL objective has a closed-form optimal policy:
RLHF KL-Constrained RL Problem:
max_{π} E[r(x,y)] - β·KL[π(y|x) || π_ref(y|x)]
Closed-Form Optimal Policy:
π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β)
Z(x) = Σ_y π_ref(y|x) · exp(r(x,y)/β) (partition function)
Solving for the Reward Function:
r(x,y) = β · log[π*(y|x) / π_ref(y|x)] + β · log Z(x)
Substituting into the Bradley-Terry Model:
P(y_w ≻ y_l) = σ(r(x,y_w) - r(x,y_l))
The partition function Z(x) cancels when subtracting the two rewards:
r(x,y_w) - r(x,y_l) = β · log[π_θ(y_w|x)/π_ref(y_w|x)]
- β · log[π_θ(y_l|x)/π_ref(y_l|x)]
DPO Loss Function:
L_DPO = -E_{(x,y_w,y_l)~D}[log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x)
- log π_θ(y_l|x)/π_ref(y_l|x)))]
Intuitive Interpretation:
- Increase π_θ(y_w|x): make the model more likely to generate preferred responses
- Decrease π_θ(y_l|x): make the model less likely to generate dispreferred responses
- Ratio relative to π_ref: ensures it doesn't deviate too far
DPO vs RLHF: A Systematic Comparison
| Dimension | RLHF (PPO) | DPO |
|---|---|---|
| Training Stages | SFT → RM → PPO (three stages) | SFT → DPO (two stages) |
| Reward Model | Requires separate training | Not needed (implicit) |
| Memory Requirements | 4 models loaded simultaneously | 2 models (π_θ + π_ref) |
| Hyperparameters | PPO has numerous hyperparameters | Primarily just β |
| Training Stability | PPO training is unstable, prone to collapse | Stable, similar to supervised learning |
| Data Requirements | Online generation + offline preferences | Offline preference data only |
| Scalability | High engineering complexity | Simple engineering, easy to implement |
| Theoretical Guarantees | Optimal under ideal conditions | Mathematically equivalent (under same assumptions) |
| Practical Performance | Usually superior at large scale | Exceptional cost-effectiveness at small scale |
| Representative Cases | InstructGPT, ChatGPT, Llama 2 | Zephyr, Mixtral-Instruct |
Zephyr[9] is DPO's most striking success story: the HuggingFace team used DPO to train a 7B parameter model that outperformed Llama 2-Chat 70B (a model trained with full RLHF) on MT-Bench. This proved DPO's outstanding cost-effectiveness for small-to-medium scale scenarios.
IPO[15] (Identity Preference Optimization) further analyzed DPO's theoretical foundations, pointing out that DPO implicitly assumes the correctness of the Bradley-Terry model. When preference data does not conform to this assumption, IPO provides a more robust alternative.
4. GRPO and DeepSeek-R1: Pure RL Eliciting Reasoning Capabilities
In early 2025, DeepSeek-AI released DeepSeek-R1[7], revealing a stunning discovery: without any human-annotated data, pure reinforcement learning can cause models to spontaneously develop reasoning capabilities. The core method is GRPO (Group Relative Policy Optimization)[11].
Core Principles of GRPO
GRPO was originally proposed in DeepSeekMath[11], designed to address two pain points of PPO: the training cost of the Critic (value model) and bias in the Reward Model.
Problems with PPO:
- Requires a Critic model to estimate the value of each token → extra memory and compute
- Reward Model may have biases → Reward Hacking
GRPO's Solution: Replace the Critic with intra-group relative ranking
GRPO Algorithm:
For each prompt x:
1. Sample a group of responses {y_1, y_2, ..., y_G} from policy π_θ
2. Score each response with a rule-based reward (or RM): {r_1, r_2, ..., r_G}
3. Compute normalized intra-group advantage:
A_i = (r_i - mean(r_1,...,r_G)) / std(r_1,...,r_G)
4. Update policy:
L_GRPO = -E[Σ_i min(ρ_i·A_i, clip(ρ_i,1-ε,1+ε)·A_i)]
- β·KL[π_θ || π_ref]
where ρ_i = π_θ(y_i|x) / π_old(y_i|x)
GRPO vs PPO:
PPO: Requires Critic to estimate V(s) → A(s,a) = R - V(s)
GRPO: Uses group mean to replace V(s) → A_i = (r_i - mean) / std
→ No Critic model needed, saving ~50% memory
DeepSeek-R1-Zero: The "Aha Moment" of RL
DeepSeek-R1-Zero is the most exciting experiment: starting from a base model, with no SFT whatsoever, it was trained directly with GRPO + rule-based rewards. The rewards consisted of just two simple rules — correct response format and correct final answer.
Remarkably, the model spontaneously developed multiple reasoning behaviors during training:
- Self-verification: the model learned to check its own logic after answering
- Reflection: the model learned to go back and correct itself upon discovering errors
- Extended chain-of-thought: the model gradually learned to produce longer, more thorough reasoning chains
- "Aha moment": at a certain stage in training, the model suddenly began using reflective language such as "Wait, let me reconsider..."
None of these behaviors were taught by humans — they emerged naturally during the RL process of maximizing correctness. This suggests a profound possibility: reasoning capabilities may be a natural byproduct of RL training, rather than something that must be learned from human demonstrations.
GRPO vs PPO vs DPO: A Three-Way Comparison
| Feature | PPO (RLHF) | DPO | GRPO |
|---|---|---|---|
| Learning Signal | Reward Model | Preference pairs (offline) | Rule-based reward / RM |
| Requires Critic | Yes | No | No (intra-group relative) |
| Requires RM | Yes | No | Optional |
| Human Annotation | Extensive | Moderate | Can be entirely eliminated |
| Memory Efficiency | Low (4 models) | High (2 models) | Medium (2-3 models) |
| Reasoning Elicitation | Indirect | Limited | Strong (spontaneous emergence) |
| Applicable Scenarios | General alignment | Preference alignment | Reasoning, math, coding |
| Representative Systems | ChatGPT, Llama 2 | Zephyr, Mixtral | DeepSeek-R1 |
5. The Full Alignment Landscape: From KTO to Self-Rewarding
Beyond RLHF, DPO, and GRPO, the alignment landscape continues to expand rapidly. Below are several important directions.
KTO: Prospect Theory-Driven Alignment
The innovation of KTO (Kahneman-Tversky Optimization)[10] is that it does not require paired preference data — only binary labels indicating "this response is good" or "this response is bad." This dramatically lowers the barrier for data annotation.
DPO Data Format: (prompt, y_w, y_l) — requires paired comparisons under the same prompt
KTO Data Format: (prompt, y, label) — only needs a good/bad binary judgment
KTO Loss Function:
L_KTO = E_{y~desirable}[w(y)·(1 - σ(β·r_θ(x,y) - z_ref))]
+ E_{y~undesirable}[w(y)·(1 - σ(z_ref - β·r_θ(x,y)))]
r_θ(x,y) = log[π_θ(y|x) / π_ref(y|x)] (implicit reward)
z_ref: reference point (expected value of KL divergence)
w(y): weight function based on prospect theory
Key Insights from Prospect Theory:
- The pain of loss > the pleasure of equivalent gain (loss aversion)
- KTO automatically reweights: applies greater penalties to bad responses
- No paired data required → suitable for collecting feedback from product logs
Self-Rewarding Language Models
Self-Rewarding[12] proposes a radical idea: let the language model serve as its own reward model. The model simultaneously plays the roles of generator and judge, achieving alignment through iterative self-improvement.
Constitutional AI (Anthropic)
Anthropic's[6] Constitutional AI uses an explicit set of principles (the "constitution") to guide AI behavior. The AI first uses these principles to self-critique and revise its responses, then uses the revised data for RLHF. This reduces dependence on the subjective judgments of human annotators.
Full Landscape Comparison of Alignment Methods
| Method | Year | Data Requirements | Training Complexity | Core Innovation |
|---|---|---|---|---|
| RLHF (PPO) | 2022 | Preference pairs + SFT data | Very high | Reward model + PPO optimization |
| DPO | 2023 | Preference pairs | Low | Implicit reward, no RM needed |
| IPO | 2024 | Preference pairs | Low | No reliance on BT model assumptions |
| KTO | 2024 | Binary labels (unpaired) | Low | Prospect theory, no pairing needed |
| GRPO | 2024 | Rule-based rewards suffice | Medium | Intra-group relative advantage, no Critic |
| Self-Rewarding | 2024 | Initial seed data | Medium | Model self-evaluation and iterative improvement |
| Constitutional AI | 2022 | Principle set + minimal human feedback | High | Principle-guided self-correction |
6. Hands-on Lab 1: DPO Fine-Tuning with TRL (Google Colab)
The following experiment uses HuggingFace's TRL library to implement DPO fine-tuning on GPT-2 small. This experiment can be run entirely on Colab's free GPU (T4), allowing you to experience the core principles of alignment techniques firsthand.
# ============================================================
# Lab 1: DPO Fine-Tuning in Practice — Aligning GPT-2 with TRL
# Environment: Google Colab (T4 GPU), approx. 15-20 minutes
# ============================================================
# --- 1. Install required packages ---
!pip install -q trl>=0.7.0 Transformer Architectures>=4.36.0 datasets peft accelerate bitsandbytes
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
)
from trl import DPOConfig, DPOTrainer
from datasets import Dataset
import warnings
warnings.filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
# --- 2. Build preference dataset ---
# Simulating a real scenario: for the same prompt, there is a chosen (good response) and rejected (bad response)
preference_data = [
{
"prompt": "What is machine learning?",
"chosen": "Machine learning is a branch of artificial intelligence that enables computers to learn patterns from data and make predictions without being explicitly programmed. It uses algorithms to build models from training data.",
"rejected": "Machine learning is when computers do stuff with data. It's like, you know, AI things. Computers are smart now I guess.",
},
{
"prompt": "Explain what a neural network is.",
"chosen": "A neural network is a computational model inspired by the human brain. It consists of layers of interconnected nodes (neurons) that process information. Each connection has a weight that is adjusted during training to minimize prediction errors.",
"rejected": "Neural networks are complicated math things that nobody really understands. They just work somehow and that's all you need to know about them.",
},
{
"prompt": "What is the difference between supervised and unsupervised learning?",
"chosen": "Supervised learning uses labeled training data where each example has a known output, allowing the model to learn input-output mappings. Unsupervised learning works with unlabeled data, discovering hidden patterns and structures such as clusters or associations.",
"rejected": "Supervised is when someone watches the computer learn and unsupervised is when nobody watches. That's basically the whole difference between them.",
},
{
"prompt": "How does gradient descent work?",
"chosen": "Gradient descent is an optimization algorithm that iteratively adjusts model parameters to minimize a loss function. It computes the gradient (partial derivatives) of the loss with respect to each parameter, then updates parameters in the opposite direction of the gradient, scaled by a learning rate.",
"rejected": "Gradient descent goes downhill. You just keep going down until you can't go down anymore. It's not that complicated really.",
},
{
"prompt": "What is overfitting in machine learning?",
"chosen": "Overfitting occurs when a model learns the training data too well, including its noise and outliers, resulting in poor generalization to new data. Signs include high training accuracy but low test accuracy. Common remedies include regularization, dropout, cross-validation, and using more training data.",
"rejected": "Overfitting is bad. It means your model memorized everything. Just add more data and it'll be fine probably.",
},
{
"prompt": "Explain the concept of regularization.",
"chosen": "Regularization is a set of techniques that prevent overfitting by adding constraints to the model. L1 regularization (Lasso) adds the absolute value of weights to the loss, promoting sparsity. L2 regularization (Ridge) adds the squared weights, encouraging smaller weight values. Both help the model generalize better.",
"rejected": "Regularization is a fancy word for making models work better. You add some penalty thing to the loss function and hope for the best.",
},
{
"prompt": "What is transfer learning?",
"chosen": "Transfer learning is a technique where a model pre-trained on a large dataset is adapted for a different but related task. Instead of training from scratch, the pre-trained model's learned representations are fine-tuned on the target task with less data. This significantly reduces training time and data requirements.",
"rejected": "Transfer learning means you take someone else's model and use it. It saves time because you don't have to train anything yourself.",
},
{
"prompt": "How does a convolutional neural network work?",
"chosen": "A convolutional neural network (CNN) processes data through convolutional layers that apply learnable filters to detect local features like edges and textures. Pooling layers reduce spatial dimensions. Deeper layers combine low-level features into high-level semantic representations. CNNs are particularly effective for image and spatial data.",
"rejected": "CNNs slide filters over images to find patterns. They work well for pictures and stuff like that.",
},
{
"prompt": "What is natural language processing?",
"chosen": "Natural language processing (NLP) is a field of AI focused on enabling computers to understand, interpret, and generate human language. Key tasks include text classification, named entity recognition, machine translation, sentiment analysis, and question answering. Modern NLP leverages transformer-based models like BERT and GPT.",
"rejected": "NLP is about making computers understand words. It's pretty useful for things like chatbots and translation apps.",
},
{
"prompt": "Explain the attention mechanism in transformers.",
"chosen": "The attention mechanism allows a model to dynamically focus on different parts of the input sequence when producing each output element. In self-attention, Query, Key, and Value vectors are computed from each token. Attention scores are calculated as the scaled dot product of Queries and Keys, then used to create weighted sums of Values, capturing contextual relationships.",
"rejected": "Attention is what makes transformers work. Each word looks at other words to figure out what's important. It's the key innovation in modern AI.",
},
{
"prompt": "What is reinforcement learning?",
"chosen": "Reinforcement learning (RL) is a paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions in states, receives rewards or penalties, and learns a policy that maximizes cumulative reward over time. Key concepts include the value function, policy gradient, and exploration-exploitation trade-off.",
"rejected": "Reinforcement learning is like training a dog with treats. Do something good, get a reward. Do something bad, no reward. Simple.",
},
{
"prompt": "How do you handle imbalanced datasets?",
"chosen": "Imbalanced datasets can be addressed through multiple strategies: oversampling the minority class (SMOTE), undersampling the majority class, using class weights in the loss function, ensemble methods like balanced random forests, anomaly detection approaches, or evaluation metrics insensitive to class distribution such as F1-score, precision-recall AUC, and Matthews correlation coefficient.",
"rejected": "Just duplicate the smaller class until both classes are the same size. That usually works fine for most problems.",
},
]
# Expand the dataset — increase data volume through paraphrasing
expanded_data = []
for item in preference_data:
expanded_data.append(item)
# Add slight variants to increase data diversity
expanded_data.append({
"prompt": "Please explain: " + item["prompt"].lower().rstrip("?.") + ".",
"chosen": item["chosen"],
"rejected": item["rejected"],
})
expanded_data.append({
"prompt": "Could you tell me: " + item["prompt"].lower(),
"chosen": item["chosen"],
"rejected": item["rejected"],
})
print(f"Total preference pairs: {len(expanded_data)}")
# Convert to HuggingFace Dataset
dataset = Dataset.from_list(expanded_data)
dataset = dataset.train_test_split(test_size=0.15, seed=42)
print(f"Train: {len(dataset['train'])}, Test: {len(dataset['test'])}")
# --- 3. Load model and tokenizer ---
model_name = "gpt2"
print(f"\nLoading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
# Load policy model
model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.pad_token_id = tokenizer.pad_token_id
# Load reference model (DPO requires a frozen reference model for KL divergence computation)
ref_model = AutoModelForCausalLM.from_pretrained(model_name)
ref_model.config.pad_token_id = tokenizer.pad_token_id
print(f"Model parameters: {model.num_parameters() / 1e6:.1f}M")
# --- 4. Pre-training response quality test ---
def generate_response(model, prompt, max_new_tokens=100):
model.eval()
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
return response.strip()
test_prompts = [
"What is machine learning?",
"Explain the attention mechanism in transformers.",
"What is reinforcement learning?",
]
print("\n" + "=" * 60)
print("BEFORE DPO Training")
print("=" * 60)
pre_dpo_responses = {}
for prompt in test_prompts:
response = generate_response(model, prompt)
pre_dpo_responses[prompt] = response
print(f"\nPrompt: {prompt}")
print(f"Response: {response[:200]}...")
# --- 5. Configure DPO training ---
dpo_config = DPOConfig(
output_dir="./dpo_output",
num_train_epochs=3,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=5e-5,
beta=0.1, # KL penalty coefficient — the most important hyperparameter in DPO
max_length=512,
max_prompt_length=128,
logging_steps=5,
eval_strategy="steps",
eval_steps=20,
save_strategy="no",
remove_unused_columns=False,
bf16=torch.cuda.is_available(),
report_to="none",
)
# --- 6. Initialize DPO Trainer and train ---
print("\nInitializing DPO Trainer...")
trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=dpo_config,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
processing_class=tokenizer,
)
print("Starting DPO training...")
train_result = trainer.train()
print(f"\nTraining complete! Total steps: {train_result.global_step}")
# --- 7. Post-training response quality test ---
print("\n" + "=" * 60)
print("AFTER DPO Training")
print("=" * 60)
post_dpo_responses = {}
for prompt in test_prompts:
response = generate_response(model, prompt)
post_dpo_responses[prompt] = response
print(f"\nPrompt: {prompt}")
print(f"Response: {response[:200]}...")
# --- 8. Visualize training process ---
train_logs = [log for log in trainer.state.log_history if "loss" in log]
eval_logs = [log for log in trainer.state.log_history if "eval_loss" in log]
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Training loss
if train_logs:
steps = [log["step"] for log in train_logs]
losses = [log["loss"] for log in train_logs]
axes[0].plot(steps, losses, color="#0077b6", linewidth=2, label="Train Loss")
axes[0].set_xlabel("Step", fontsize=12)
axes[0].set_ylabel("DPO Loss", fontsize=12)
axes[0].set_title("DPO Training Loss", fontsize=14)
axes[0].grid(True, alpha=0.3)
axes[0].legend()
# Evaluation loss
if eval_logs:
eval_steps = [log["step"] for log in eval_logs]
eval_losses = [log["eval_loss"] for log in eval_logs]
axes[1].plot(eval_steps, eval_losses, "o-", color="#b8922e", linewidth=2, label="Eval Loss")
axes[1].set_xlabel("Step", fontsize=12)
axes[1].set_ylabel("Eval Loss", fontsize=12)
axes[1].set_title("DPO Evaluation Loss", fontsize=14)
axes[1].grid(True, alpha=0.3)
axes[1].legend()
# Reward margins (if recorded in logs)
reward_logs = [log for log in trainer.state.log_history if "rewards/margins" in log]
if reward_logs:
r_steps = [log["step"] for log in reward_logs]
margins = [log["rewards/margins"] for log in reward_logs]
axes[2].plot(r_steps, margins, "s-", color="#e63946", linewidth=2, label="Reward Margin")
axes[2].set_xlabel("Step", fontsize=12)
axes[2].set_ylabel("Margin (chosen - rejected)", fontsize=12)
axes[2].set_title("Reward Margins", fontsize=14)
axes[2].axhline(y=0, color="gray", linestyle="--", alpha=0.5)
axes[2].grid(True, alpha=0.3)
axes[2].legend()
else:
axes[2].text(0.5, 0.5, "Reward margins\nnot logged",
ha="center", va="center", fontsize=14, color="gray",
transform=axes[2].transAxes)
axes[2].set_title("Reward Margins", fontsize=14)
plt.tight_layout()
plt.savefig("dpo_training_results.png", dpi=150, bbox_inches="tight")
plt.show()
# --- 9. Implicit reward analysis ---
# Core insight of DPO: the policy itself is an implicit reward model
# r(x,y) = β * log(π_θ(y|x) / π_ref(y|x))
print("\n" + "=" * 60)
print("Implicit Reward Analysis")
print("=" * 60)
def compute_implicit_reward(model, ref_model, tokenizer, prompt, response, beta=0.1):
"""Compute the implicit reward score of DPO"""
model.eval()
ref_model.eval()
full_text = prompt + " " + response
inputs = tokenizer(full_text, return_tensors="pt").to(model.device)
prompt_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
prompt_len = prompt_ids.shape[1]
with torch.no_grad():
logits = model(**inputs).logits
ref_logits = ref_model(**inputs).logits
# Compute log probabilities for the response portion
response_logits = logits[:, prompt_len - 1:-1, :]
ref_response_logits = ref_logits[:, prompt_len - 1:-1, :]
response_ids = inputs["input_ids"][:, prompt_len:]
log_probs = torch.log_softmax(response_logits, dim=-1)
ref_log_probs = torch.log_softmax(ref_response_logits, dim=-1)
token_log_probs = log_probs.gather(2, response_ids.unsqueeze(-1)).squeeze(-1)
ref_token_log_probs = ref_log_probs.gather(2, response_ids.unsqueeze(-1)).squeeze(-1)
# Implicit reward = β * Σ (log π_θ - log π_ref)
implicit_reward = beta * (token_log_probs.sum() - ref_token_log_probs.sum()).item()
return implicit_reward
ref_model = ref_model.to(device)
model = model.to(device)
sample_prompt = "What is machine learning?"
good_response = "Machine learning is a branch of AI that enables computers to learn from data and make predictions without explicit programming."
bad_response = "Machine learning is when computers do stuff. It's like AI things I guess."
reward_good = compute_implicit_reward(model, ref_model, tokenizer, sample_prompt, good_response)
reward_bad = compute_implicit_reward(model, ref_model, tokenizer, sample_prompt, bad_response)
print(f"Prompt: {sample_prompt}")
print(f"Good response reward: {reward_good:.4f}")
print(f"Bad response reward: {reward_bad:.4f}")
print(f"Margin (good - bad): {reward_good - reward_bad:.4f}")
print(f"P(good ≻ bad) = σ(margin) = {torch.sigmoid(torch.tensor(reward_good - reward_bad)).item():.4f}")
print("\nLab 1 Complete!")
7. Hands-on Lab 2: Reward Model Training and Evaluation (Google Colab)
The following experiment uses TRL's RewardTrainer to train a simple reward model and evaluate its ranking accuracy. The reward model is the core component of RLHF — it quantifies subjective human preferences into optimizable scalar scores.
# ============================================================
# Lab 2: Reward Model Training and Evaluation
# Environment: Google Colab (T4 GPU), approx. 10-15 minutes
# ============================================================
# --- 1. Install required packages ---
!pip install -q trl>=0.7.0 transformers>=4.36.0 datasets accelerate
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
)
from trl import RewardConfig, RewardTrainer
from datasets import Dataset
from sklearn.metrics import accuracy_score, roc_auc_score
import warnings
warnings.filterwarnings("ignore")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device: {device}")
# --- 2. Build preference dataset ---
# Each entry: (prompt, chosen, rejected)
# chosen is the annotator-preferred response, rejected is the dispreferred response
preference_data = [
# Accuracy preference
{
"prompt": "What is the capital of France?",
"chosen": "The capital of France is Paris. It is the largest city in France and serves as the country's political, economic, and cultural center.",
"rejected": "The capital of France is probably Lyon or Marseille. It's one of those big cities in France.",
},
{
"prompt": "How many planets are in our solar system?",
"chosen": "There are eight planets in our solar system: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Pluto was reclassified as a dwarf planet in 2006.",
"rejected": "There are nine planets including Pluto. Everyone knows that from school.",
},
# Helpfulness preference
{
"prompt": "How do I make a cup of coffee?",
"chosen": "To make a cup of coffee: 1) Boil fresh water to about 93-96 degrees Celsius. 2) Add 2 tablespoons of ground coffee per 180ml of water to your filter. 3) Pour hot water over the grounds slowly. 4) Let it brew for 3-4 minutes. 5) Serve and enjoy.",
"rejected": "Just put coffee in water and heat it up.",
},
{
"prompt": "What should I consider when buying a laptop?",
"chosen": "Key factors when buying a laptop include: processor speed and generation (Intel i5/i7 or AMD Ryzen 5/7), RAM (16GB minimum for productivity), storage type and size (SSD preferred, 512GB+), display quality and size, battery life, weight for portability, and your budget relative to intended use.",
"rejected": "Just get whatever is on sale. They're all pretty much the same these days.",
},
# Safety preference
{
"prompt": "How can I lose weight quickly?",
"chosen": "For healthy and sustainable weight loss, aim for 0.5-1kg per week through a balanced approach: create a modest calorie deficit (300-500 cal/day), increase physical activity, eat whole foods rich in protein and fiber, stay hydrated, and get adequate sleep. Consult a healthcare professional before starting any diet.",
"rejected": "Just stop eating for a few days and you'll lose weight fast. Fasting is the quickest way to drop pounds.",
},
{
"prompt": "I'm feeling really stressed about work.",
"chosen": "I understand work stress can be overwhelming. Some evidence-based strategies include: practicing deep breathing exercises, taking regular short breaks, setting clear boundaries between work and personal time, prioritizing tasks, talking to a trusted friend or counselor, and engaging in physical activity. If stress persists, consider speaking with a mental health professional.",
"rejected": "Just quit your job if it stresses you out. Life's too short to deal with that.",
},
# Depth preference
{
"prompt": "Explain how vaccines work.",
"chosen": "Vaccines work by training the immune system to recognize and fight specific pathogens. They contain weakened or inactivated forms of a virus, or key proteins from it. When administered, the immune system produces antibodies and memory cells. If later exposed to the actual pathogen, the immune system can respond quickly and effectively, preventing or reducing illness severity.",
"rejected": "Vaccines put stuff in your body that makes you immune to diseases. They've been around for a long time.",
},
{
"prompt": "Why is the sky blue?",
"chosen": "The sky appears blue due to Rayleigh scattering. Sunlight contains all colors of the visible spectrum. As it enters Earth's atmosphere, shorter wavelengths (blue and violet) scatter more than longer wavelengths (red and orange) when they collide with gas molecules. Our eyes are more sensitive to blue than violet, so we perceive the sky as blue.",
"rejected": "The sky is blue because of the atmosphere. It just scatters light in a way that makes it look blue.",
},
# Format preference
{
"prompt": "List three benefits of exercise.",
"chosen": "Three key benefits of regular exercise are: 1) Improved cardiovascular health, reducing the risk of heart disease and stroke. 2) Better mental health, as exercise releases endorphins that reduce stress, anxiety, and depression. 3) Enhanced cognitive function, including improved memory, focus, and reduced risk of neurodegenerative diseases.",
"rejected": "Exercise is good for your heart and mind and body. It helps you in many ways and you should do it regularly because doctors recommend it.",
},
{
"prompt": "What is photosynthesis?",
"chosen": "Photosynthesis is the biological process by which green plants, algae, and some bacteria convert light energy into chemical energy. Using chlorophyll in chloroplasts, they absorb sunlight and use it to transform carbon dioxide and water into glucose and oxygen. The simplified equation is: 6CO2 + 6H2O + light energy -> C6H12O6 + 6O2.",
"rejected": "Photosynthesis is how plants make food from sunlight. They use their leaves to capture energy and turn it into sugar.",
},
{
"prompt": "How does encryption work?",
"chosen": "Encryption converts readable data (plaintext) into unreadable form (ciphertext) using mathematical algorithms and keys. Symmetric encryption uses the same key for encryption and decryption (e.g., AES). Asymmetric encryption uses a public key to encrypt and a private key to decrypt (e.g., RSA). Modern encryption ensures data confidentiality even if intercepted.",
"rejected": "Encryption scrambles your data so hackers can't read it. It's like a secret code.",
},
{
"prompt": "What causes seasons on Earth?",
"chosen": "Seasons are caused by Earth's axial tilt of approximately 23.5 degrees relative to its orbital plane around the Sun. This tilt means different hemispheres receive varying amounts of direct sunlight throughout the year. When the Northern Hemisphere tilts toward the Sun, it experiences summer while the Southern Hemisphere has winter, and vice versa.",
"rejected": "Seasons happen because the Earth moves closer and farther from the Sun during the year.",
},
]
# Expand the dataset
expanded = []
for item in preference_data:
expanded.append(item)
expanded.append({
"prompt": "Q: " + item["prompt"],
"chosen": item["chosen"],
"rejected": item["rejected"],
})
expanded.append({
"prompt": "Answer this question: " + item["prompt"],
"chosen": item["chosen"],
"rejected": item["rejected"],
})
print(f"Total preference pairs: {len(expanded)}")
dataset = Dataset.from_list(expanded)
split = dataset.train_test_split(test_size=0.2, seed=42)
train_dataset = split["train"]
test_dataset = split["test"]
print(f"Train: {len(train_dataset)}, Test: {len(test_dataset)}")
# --- 3. Load model ---
model_name = "distilbert-base-uncased"
print(f"\nLoading reward model backbone: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=1, # Reward model outputs a single scalar score
)
print(f"Model parameters: {model.num_parameters() / 1e6:.1f}M")
# --- 4. Data preprocessing ---
def preprocess_function(examples):
"""Convert preference pairs to Reward Model training format"""
chosen_texts = [
p + " [SEP] " + c
for p, c in zip(examples["prompt"], examples["chosen"])
]
rejected_texts = [
p + " [SEP] " + r
for p, r in zip(examples["prompt"], examples["rejected"])
]
chosen_encodings = tokenizer(
chosen_texts, truncation=True, padding="max_length", max_length=256
)
rejected_encodings = tokenizer(
rejected_texts, truncation=True, padding="max_length", max_length=256
)
return {
"input_ids_chosen": chosen_encodings["input_ids"],
"attention_mask_chosen": chosen_encodings["attention_mask"],
"input_ids_rejected": rejected_encodings["input_ids"],
"attention_mask_rejected": rejected_encodings["attention_mask"],
}
train_processed = train_dataset.map(preprocess_function, batched=True, remove_columns=train_dataset.column_names)
test_processed = test_dataset.map(preprocess_function, batched=True, remove_columns=test_dataset.column_names)
# --- 5. Train Reward Model ---
training_args = RewardConfig(
output_dir="./reward_model_output",
num_train_epochs=5,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=2e-5,
weight_decay=0.01,
logging_steps=5,
eval_strategy="steps",
eval_steps=10,
save_strategy="no",
max_length=256,
remove_unused_columns=False,
report_to="none",
)
trainer = RewardTrainer(
model=model,
args=training_args,
train_dataset=train_processed,
eval_dataset=test_processed,
processing_class=tokenizer,
)
print("\nStarting Reward Model training...")
train_result = trainer.train()
print(f"Training complete! Steps: {train_result.global_step}")
# --- 6. Evaluate Reward Model ---
print("\n" + "=" * 60)
print("Reward Model Evaluation")
print("=" * 60)
def get_reward_score(model, tokenizer, prompt, response):
"""Get the reward model's score for a (prompt, response) pair"""
model.eval()
text = prompt + " [SEP] " + response
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256, padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
return outputs.logits.item()
# Compute ranking accuracy on the original test data
correct = 0
total = 0
chosen_rewards = []
rejected_rewards = []
for item in preference_data:
r_chosen = get_reward_score(model, tokenizer, item["prompt"], item["chosen"])
r_rejected = get_reward_score(model, tokenizer, item["prompt"], item["rejected"])
chosen_rewards.append(r_chosen)
rejected_rewards.append(r_rejected)
if r_chosen > r_rejected:
correct += 1
total += 1
accuracy = correct / total
print(f"Ranking Accuracy: {accuracy:.1%} ({correct}/{total})")
print(f"Average Chosen Reward: {np.mean(chosen_rewards):.4f}")
print(f"Average Rejected Reward: {np.mean(rejected_rewards):.4f}")
print(f"Average Margin: {np.mean(np.array(chosen_rewards) - np.array(rejected_rewards)):.4f}")
# --- 7. Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Reward distribution
axes[0].hist(chosen_rewards, bins=10, alpha=0.7, color="#0077b6", label="Chosen", edgecolor="white")
axes[0].hist(rejected_rewards, bins=10, alpha=0.7, color="#e63946", label="Rejected", edgecolor="white")
axes[0].set_xlabel("Reward Score", fontsize=12)
axes[0].set_ylabel("Count", fontsize=12)
axes[0].set_title("Reward Score Distribution", fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
# Per-sample comparison
x_pos = np.arange(len(preference_data))
width = 0.35
axes[1].bar(x_pos - width / 2, chosen_rewards, width, color="#0077b6", label="Chosen", alpha=0.8)
axes[1].bar(x_pos + width / 2, rejected_rewards, width, color="#e63946", label="Rejected", alpha=0.8)
axes[1].set_xlabel("Sample Index", fontsize=12)
axes[1].set_ylabel("Reward Score", fontsize=12)
axes[1].set_title("Per-Sample Reward Comparison", fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3, axis="y")
# Training loss curve
train_logs = [log for log in trainer.state.log_history if "loss" in log]
if train_logs:
steps = [log["step"] for log in train_logs]
losses = [log["loss"] for log in train_logs]
axes[2].plot(steps, losses, color="#b8922e", linewidth=2)
axes[2].set_xlabel("Step", fontsize=12)
axes[2].set_ylabel("Loss", fontsize=12)
axes[2].set_title("Reward Model Training Loss", fontsize=14)
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("reward_model_results.png", dpi=150, bbox_inches="tight")
plt.show()
# --- 8. Interactive test: Reward scoring for custom prompts ---
print("\n" + "=" * 60)
print("Interactive Reward Scoring")
print("=" * 60)
test_cases = [
{
"prompt": "What is deep learning?",
"responses": [
"Deep learning is a subset of machine learning that uses artificial neural networks with multiple layers (hence 'deep') to model and understand complex patterns in data. It excels at tasks like image recognition, natural language processing, and game playing by automatically learning hierarchical representations.",
"Deep learning is basically AI that uses lots of layers. It's really powerful and can do many things.",
"Deep learning is a type of machine learning. It uses neural networks to learn from data and make predictions about stuff.",
],
},
{
"prompt": "Is it safe to eat raw chicken?",
"responses": [
"No, eating raw chicken is not safe. Raw chicken frequently contains harmful bacteria such as Salmonella and Campylobacter, which can cause serious foodborne illness. Always cook chicken to an internal temperature of at least 74 degrees Celsius (165 degrees Fahrenheit) to ensure these pathogens are eliminated.",
"Sure, some people eat raw chicken in certain cuisines. It should be fine if the chicken is fresh.",
"I wouldn't recommend it but it probably won't kill you. Just make sure it smells okay.",
],
},
]
for case in test_cases:
print(f"\nPrompt: {case['prompt']}")
scores = []
for i, resp in enumerate(case["responses"]):
score = get_reward_score(model, tokenizer, case["prompt"], resp)
scores.append(score)
print(f" Response {i + 1} (reward={score:.4f}): {resp[:80]}...")
# Ranking
ranked = sorted(enumerate(scores), key=lambda x: x[1], reverse=True)
print(f" Ranking: {' > '.join([f'R{idx + 1}({s:.3f})' for idx, s in ranked])}")
# --- 9. Bradley-Terry preference probabilities ---
print("\n" + "=" * 60)
print("Bradley-Terry Preference Probabilities")
print("=" * 60)
def bradley_terry_prob(r1, r2):
"""P(response1 > response2) = sigmoid(r1 - r2)"""
return torch.sigmoid(torch.tensor(r1 - r2)).item()
for case in test_cases:
print(f"\nPrompt: {case['prompt'][:50]}...")
scores = [get_reward_score(model, tokenizer, case["prompt"], r) for r in case["responses"]]
for i in range(len(scores)):
for j in range(i + 1, len(scores)):
prob = bradley_terry_prob(scores[i], scores[j])
print(f" P(R{i+1} ≻ R{j+1}) = {prob:.4f}")
print("\nLab 2 Complete!")
8. Decision Framework: How Enterprises Should Choose an Alignment Strategy
Facing multiple alignment techniques, enterprises need to make pragmatic choices based on their resources, objectives, and constraints. Below is a systematic decision framework.
Decision Dimension 1: Data Availability
| Data Scenario | Recommended Method | Rationale |
|---|---|---|
| Large volume of paired preference data (>10K pairs) | RLHF or DPO | Both perform similarly with sufficient data; DPO is more economical |
| Moderate preference data (1K-10K pairs) | DPO | DPO is more stable with moderate data volumes |
| Only binary labels (good/bad) | KTO | No pairing required; can be collected from product logs |
| Verifiable correct answers available | GRPO | Rule-based rewards eliminate the need for human annotation |
| Almost no data | Constitutional AI / Self-Rewarding | Uses principles or model self-evaluation to replace human annotation |
Decision Dimension 2: Budget and Technical Capability
| Resource Level | Recommended Method | Estimated Cost |
|---|---|---|
| High-budget team (GPU cluster + ML experts) | RLHF (PPO) | High compute + annotation costs |
| Medium budget (single-node multi-GPU + engineers) | DPO or GRPO | Moderate compute, low annotation costs |
| Low budget (single GPU + developers) | KTO or DPO + LoRA | Minimal compute, minimal annotation costs |
Decision Dimension 3: Application Objectives
| Objective | Recommended Method | Explanation |
|---|---|---|
| General-purpose chat assistant | RLHF or DPO | Requires balancing helpfulness and safety |
| Math/code reasoning | GRPO | Correctness can serve as a rule-based reward |
| Domain-specific assistant | DPO + domain preference data | Cost-effective with stable results |
| Safety alignment | Constitutional AI + RLHF | Principle-guided + human oversight |
| Continuous improvement | Self-Rewarding + iterative DPO | Automated iterative optimization |
Cost-Benefit Analysis
ROI of Alignment Methods (rough estimates):
Initial Investment Maintenance Alignment Quality Applicable Scale
RLHF (PPO): $$$$$ $$$ ★★★★★ 10B+ models
DPO: $$ $ ★★★★ 1B-70B models
KTO: $ $ ★★★ 1B-13B models
GRPO: $$$ $$ ★★★★★ Reasoning tasks
Self-Rewarding: $$ $ ★★★ Research stage
Typical ROI Scenarios:
- Startups: DPO + LoRA fine-tuning a 7B model → highest cost-effectiveness
- Mid-size enterprises: DPO fine-tuning 13B-70B models → balancing quality and cost
- Large tech companies: Full RLHF pipeline → highest quality
- Research teams: GRPO for exploring reasoning capabilities → frontier breakthroughs
9. Conclusion and Outlook
From Christiano et al.[2] proposing "learning from human preferences" in robotic control in 2017, to InstructGPT[1] systematically applying RLHF to language models in 2022, to DeepSeek-R1[7] using pure RL to elicit reasoning capabilities in 2025 — alignment techniques have undergone a revolutionary evolution in just a few years.
Several trends worth watching:
- From human annotation to automation: RLHF requires extensive human annotation; DPO reduced engineering complexity; GRPO uses rule-based rewards to replace human judgment; Self-Rewarding[12] lets models judge themselves. The human labor cost of alignment is declining sharply.
- From alignment to capability elicitation: Early RLHF was primarily used to "prevent models from saying harmful things"; GRPO demonstrated that RL can elicit entirely new capabilities (such as reasoning). Alignment techniques are evolving from "safety guardrails" into "capability amplifiers."
- From offline to online: DPO uses offline preference data; Online DPO and GRPO generate data in real time during training. Online methods generally explore the policy space more effectively.
- The challenge of multi-objective alignment: Helpfulness, safety, honesty, fairness — tensions exist among these objectives. Future alignment techniques will need to more delicately balance multiple dimensions of human values[6].
- Scalable oversight: As model capabilities surpass those of human annotators, providing effective supervisory signals becomes a core challenge. Constitutional AI and Self-Rewarding represent early explorations in this direction.
Alignment is not merely a technical problem — it is a philosophical one: what "human values" do we ultimately want AI to align with? Whose values? How do we strike a balance across different cultures and communities? The answers to these questions will profoundly shape the future trajectory of AI.
For practitioners, this article recommends: start with DPO. It is currently the most streamlined and cost-effective alignment method from an engineering perspective. As your model scale and quality requirements grow, consider a full RLHF pipeline or explore GRPO for reasoning capability elicitation. The evolution of alignment techniques teaches us that the best method is often the simplest — as long as the math is right.



