CodeReview-Qwen32B
A fine-tuned Qwen2.5-Coder-32B-Instruct model specialized for code review. Trained on 48,460 human-written code review comments from the top GitHub repositories.
This is a QLoRA adapter (4-bit) trained with SFT.
Results
| Metric | Base Model (Zero-Shot) | SFT Fine-Tuned | Change |
|---|---|---|---|
| BLEU-4 | 3.82 | 16.91 | +343% |
| ROUGE-L F1 | 0.081 | 0.216 | +167% |
| Sentence-BERT Cosine Sim | 0.477 | 0.526 | +10% |
| Comment Type Accuracy | 0.00 | 0.640 | -- |
| Comment Type Coverage | 0.00 | 1.00 | -- |
The fine-tuned model produces review comments with 4x more lexical overlap with real human reviews, nearly 3x longer matching subsequences, and correctly classifies its own comment type (suggestion, issue, refactoring, etc.) 64% of the time.
Model Details
- Base model: Qwen/Qwen2.5-Coder-32B-Instruct
- Method: QLoRA (4-bit) SFT → DPO
- Training data: ronantakizawa/github-codereview (119K examples, 48K after filtering)
- Hardware: 1x NVIDIA H100 SXM 80GB
- Training time: ~6 hours SFT + ~1.7 hours DPO
- Framework: Unsloth + TRL 0.24.0 + Transformers 5.2.0
Training
Dataset
The training data comes from ronantakizawa/github-codereview, containing 119K real inline code review triplets (code, review comment, revised code) extracted from merged pull requests across 504 top GitHub repositories.
Filtering: Quality score >= 0.5, minimum 50 char comments, deduplicated by (repo, PR, line), capped at 1000 examples per repo. 25 repos held out entirely for OOD evaluation.
Multi-Task SFT
The model was trained on four tasks simultaneously:
| Task | Mix | Description |
|---|---|---|
| Comment Generation | 50% | Code + diff → [comment_type] review comment |
| Code Refinement | 20% | Code + comment → revised code |
| Structured Review | 15% | Multiple diffs → numbered multi-issue review |
| Negative Examples | 15% | Clean code → "no issues found" (reduces false positives) |
SFT Hyperparameters
| Parameter | Value |
|---|---|
| LoRA rank | 32 |
| LoRA alpha | 32 |
| LoRA dropout | 0 |
| Target modules | q/k/v/o_proj, gate/up/down_proj |
| Learning rate | 2e-4 (cosine, 5% warmup) |
| Effective batch size | 16 (per_device=2, grad_accum=8) |
| Epochs | 2 |
| Max sequence length | 6,144 tokens |
| Precision | bf16 |
| Packing | Enabled |
| Gradient checkpointing | Unsloth mode |
| Total steps | 3,029 |
Usage
With Unsloth (Recommended)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="ronantakizawa/codereview-qwen32b",
max_seq_length=6144,
dtype=None,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
messages = [
{"role": "system", "content": (
"You are an expert code reviewer for Python projects. "
"Analyze the code and diff context, then provide a precise, "
"actionable review comment identifying issues or suggesting improvements. "
"If the code is well-written and has no issues, state that clearly."
)},
{"role": "user", "content": """**Pull Request:** Fix user authentication
**Repository:** example/app
**File:** auth/login.py
**Language:** Python
**Code under review (around line 42):**
```Python
def authenticate(username, password):
user = db.query(f"SELECT * FROM users WHERE name='{username}'")
if user and user.password == password:
return create_session(user)
return None
Diff context:
+def authenticate(username, password):
+ user = db.query(f"SELECT * FROM users WHERE name='{username}'")
+ if user and user.password == password:
+ return create_session(user)
+ return None
Review this code change and provide your comment."""}, ]
input_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.1) print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
### With Transformers + PEFT
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-32B-Instruct",
device_map="auto",
torch_dtype="auto",
load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, "ronantakizawa/codereview-qwen32b")
tokenizer = AutoTokenizer.from_pretrained("ronantakizawa/codereview-qwen32b")
Output Format
The model prefixes its output with a comment type tag:
[suggestion] Consider using parameterized queries instead of string formatting
to prevent SQL injection vulnerabilities. Replace the f-string with
db.query("SELECT * FROM users WHERE name=?", (username,))
Comment types: suggestion, issue, refactoring, style, performance, security, documentation, no_issues
Limitations
- Context window: Trained with 6,144 token sequences. Very large files are truncated around the review line.
- Single-comment focus: Primarily trained on individual inline comments, not full PR-level reviews (though 15% of training was structured multi-issue reviews).
- Language coverage: Best on Python, JavaScript, TypeScript, Java, Go, C++, Rust (the most represented languages in the training data). May underperform on less common languages.
- DPO regression: The DPO step did not improve over SFT with quality-score-based preference pairs. Model-generated pairs with an LLM judge may yield better results.
Citation
@misc{codereview-qwen32b,
title={CodeReview-Qwen32B: Fine-tuned Qwen2.5-Coder-32B for Automated Code Review},
author={Ronan Takizawa},
year={2026},
url={https://huggingface.co/ronantakizawa/codereview-qwen32b}
}
- Downloads last month
- 2
Model tree for ronantakizawa/codereview-qwen32b-adapter
Base model
Qwen/Qwen2.5-32BDataset used to train ronantakizawa/codereview-qwen32b-adapter
Evaluation results
- BLEU-4 on GitHub Code Reviewtest set self-reported16.910
- ROUGE-L F1 on GitHub Code Reviewtest set self-reported0.216
- Sentence-BERT Cosine Similarity on GitHub Code Reviewtest set self-reported0.526
- Comment Type Accuracy on GitHub Code Reviewtest set self-reported0.640