Gemma3-27B ORPO Thinking v2 (LoRA Adapter)

ORPO (Odds Ratio Preference Optimization) LoRA adapter trained on 20,751 real user preference pairs from production chatbot services.

Base Model: llm-model-lab/thinking-v2-3epoch

Model Details

Training Method: ORPO with LoRA (r=128, alpha=256, dropout=0.05)
Training Data: Real user reroll signals (chosen vs rejected responses)
Training Hardware: 6x H200 GPUs (35h 46min)
Special Features:
- Chain-of-thought reasoning with <think> tokens
- Custom Gemma3 chat template (system → <instruction> tag)
- 7 target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training Metrics

Metric	Initial	Final (Epoch 3)
Loss	2.083	1.164
NLL Loss	2.012	1.096
Rewards Accuracy	38.9%	59.3%
Eval Loss	2.054	1.164

No overfitting detected (eval loss decreased consistently across all epochs).

Usage with vLLM (Recommended)

Serve LoRA Adapter

vllm serve llm-model-lab/thinking-v2-3epoch \
  --enable-lora \
  --lora-modules orpo=llm-model-lab/gemma3-27b-orpo-thinking-v2 \
  --max-lora-rank 128 \
  --dtype bfloat16 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9

Inference via OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="orpo",  # matches lora-modules key
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Explain quantum computing simply."}
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    max_tokens=2048
)

print(response.choices[0].message.content)

Usage with Transformers + PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "llm-model-lab/thinking-v2-3epoch",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager"
)

# Load ORPO adapter
model = PeftModel.from_pretrained(
    model,
    "llm-model-lab/gemma3-27b-orpo-thinking-v2"
)

tokenizer = AutoTokenizer.from_pretrained("llm-model-lab/gemma3-27b-orpo-thinking-v2")

# Generate
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training procedure

This model was trained with ORPO, a method introduced in ORPO: Monolithic Preference Optimization without Reference Model.

Framework versions

TRL: 0.27.1
Transformers: 5.0.0
Pytorch: 2.8.0+cu128
Datasets: 4.5.0
Tokenizers: 0.22.2

Citations

Cite ORPO as:

@article{hong2024orpo,
    title        = {{ORPO: Monolithic Preference Optimization without Reference Model}},
    author       = {Jiwoo Hong and Noah Lee and James Thorne},
    year         = 2024,
    eprint       = {arXiv:2403.07691}
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Downloads last month: 3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for llm-model-lab/gemma3-27b-orpo-thinking-v2

Base model

llm-model-lab/thinking-v2-3epoch

Adapter

(4)

this model

Paper for llm-model-lab/gemma3-27b-orpo-thinking-v2

ORPO: Monolithic Preference Optimization without Reference Model

Paper • 2403.07691 • Published Mar 12, 2024 • 72