ORPO: Monolithic Preference Optimization without Reference Model
Paper • 2403.07691 • Published • 72
ORPO (Odds Ratio Preference Optimization) LoRA adapter trained on 20,751 real user preference pairs from production chatbot services.
Base Model: llm-model-lab/thinking-v2-3epoch
<think> tokens<instruction> tag)| Metric | Initial | Final (Epoch 3) |
|---|---|---|
| Loss | 2.083 | 1.164 |
| NLL Loss | 2.012 | 1.096 |
| Rewards Accuracy | 38.9% | 59.3% |
| Eval Loss | 2.054 | 1.164 |
No overfitting detected (eval loss decreased consistently across all epochs).
vllm serve llm-model-lab/thinking-v2-3epoch \
--enable-lora \
--lora-modules orpo=llm-model-lab/gemma3-27b-orpo-thinking-v2 \
--max-lora-rank 128 \
--dtype bfloat16 \
--trust-remote-code \
--gpu-memory-utilization 0.9
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
response = client.chat.completions.create(
model="orpo", # matches lora-modules key
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain quantum computing simply."}
],
extra_body={"chat_template_kwargs": {"enable_thinking": True}},
max_tokens=2048
)
print(response.choices[0].message.content)
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"llm-model-lab/thinking-v2-3epoch",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager"
)
# Load ORPO adapter
model = PeftModel.from_pretrained(
model,
"llm-model-lab/gemma3-27b-orpo-thinking-v2"
)
tokenizer = AutoTokenizer.from_pretrained("llm-model-lab/gemma3-27b-orpo-thinking-v2")
# Generate
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
This model was trained with ORPO, a method introduced in ORPO: Monolithic Preference Optimization without Reference Model.
Cite ORPO as:
@article{hong2024orpo,
title = {{ORPO: Monolithic Preference Optimization without Reference Model}},
author = {Jiwoo Hong and Noah Lee and James Thorne},
year = 2024,
eprint = {arXiv:2403.07691}
}
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
Base model
llm-model-lab/thinking-v2-3epoch