CodeReview-Qwen32B

A fine-tuned Qwen2.5-Coder-32B-Instruct model specialized for code review. Trained on 48,460 human-written code review comments from the top GitHub repositories.

This is a QLoRA adapter (4-bit) trained with SFT.

Results

Metric	Base Model (Zero-Shot)	SFT Fine-Tuned	Change
BLEU-4	3.82	16.91	+343%
ROUGE-L F1	0.081	0.216	+167%
Sentence-BERT Cosine Sim	0.477	0.526	+10%
Comment Type Accuracy	0.00	0.640	--
Comment Type Coverage	0.00	1.00	--

The fine-tuned model produces review comments with 4x more lexical overlap with real human reviews, nearly 3x longer matching subsequences, and correctly classifies its own comment type (suggestion, issue, refactoring, etc.) 64% of the time.

Model Details

Base model: Qwen/Qwen2.5-Coder-32B-Instruct
Method: QLoRA (4-bit) SFT → DPO
Training data: ronantakizawa/github-codereview (119K examples, 48K after filtering)
Hardware: 1x NVIDIA H100 SXM 80GB
Training time: ~6 hours SFT + ~1.7 hours DPO
Framework: Unsloth + TRL 0.24.0 + Transformers 5.2.0

Training

Dataset

The training data comes from ronantakizawa/github-codereview, containing 119K real inline code review triplets (code, review comment, revised code) extracted from merged pull requests across 504 top GitHub repositories.

Filtering: Quality score >= 0.5, minimum 50 char comments, deduplicated by (repo, PR, line), capped at 1000 examples per repo. 25 repos held out entirely for OOD evaluation.

Multi-Task SFT

The model was trained on four tasks simultaneously:

Task	Mix	Description
Comment Generation	50%	Code + diff → `[comment_type] review comment`
Code Refinement	20%	Code + comment → revised code
Structured Review	15%	Multiple diffs → numbered multi-issue review
Negative Examples	15%	Clean code → "no issues found" (reduces false positives)

SFT Hyperparameters

Parameter	Value
LoRA rank	32
LoRA alpha	32
LoRA dropout	0
Target modules	q/k/v/o_proj, gate/up/down_proj
Learning rate	2e-4 (cosine, 5% warmup)
Effective batch size	16 (per_device=2, grad_accum=8)
Epochs	2
Max sequence length	6,144 tokens
Precision	bf16
Packing	Enabled
Gradient checkpointing	Unsloth mode
Total steps	3,029

Usage

With Unsloth (Recommended)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="ronantakizawa/codereview-qwen32b",
    max_seq_length=6144,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {"role": "system", "content": (
        "You are an expert code reviewer for Python projects. "
        "Analyze the code and diff context, then provide a precise, "
        "actionable review comment identifying issues or suggesting improvements. "
        "If the code is well-written and has no issues, state that clearly."
    )},
    {"role": "user", "content": """**Pull Request:** Fix user authentication
**Repository:** example/app
**File:** auth/login.py
**Language:** Python

**Code under review (around line 42):**
```Python
def authenticate(username, password):
    user = db.query(f"SELECT * FROM users WHERE name='{username}'")
    if user and user.password == password:
        return create_session(user)
    return None

Diff context:

+def authenticate(username, password):
+    user = db.query(f"SELECT * FROM users WHERE name='{username}'")
+    if user and user.password == password:
+        return create_session(user)
+    return None

Review this code change and provide your comment."""}, ]

input_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512, temperature=0.1) print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))


### With Transformers + PEFT

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-32B-Instruct",
    device_map="auto",
    torch_dtype="auto",
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, "ronantakizawa/codereview-qwen32b")
tokenizer = AutoTokenizer.from_pretrained("ronantakizawa/codereview-qwen32b")

Output Format

The model prefixes its output with a comment type tag:

[suggestion] Consider using parameterized queries instead of string formatting
to prevent SQL injection vulnerabilities. Replace the f-string with
db.query("SELECT * FROM users WHERE name=?", (username,))

Comment types: suggestion, issue, refactoring, style, performance, security, documentation, no_issues

Limitations

Context window: Trained with 6,144 token sequences. Very large files are truncated around the review line.
Single-comment focus: Primarily trained on individual inline comments, not full PR-level reviews (though 15% of training was structured multi-issue reviews).
Language coverage: Best on Python, JavaScript, TypeScript, Java, Go, C++, Rust (the most represented languages in the training data). May underperform on less common languages.
DPO regression: The DPO step did not improve over SFT with quality-score-based preference pairs. Model-generated pairs with an LLM judge may yield better results.

Citation

@misc{codereview-qwen32b,
  title={CodeReview-Qwen32B: Fine-tuned Qwen2.5-Coder-32B for Automated Code Review},
  author={Ronan Takizawa},
  year={2026},
  url={https://huggingface.co/ronantakizawa/codereview-qwen32b}
}

Downloads last month: 2

Model tree for ronantakizawa/codereview-qwen32b-adapter

Base model

Qwen/Qwen2.5-32B

Finetuned

Qwen/Qwen2.5-Coder-32B

Finetuned

Qwen/Qwen2.5-Coder-32B-Instruct

Adapter

(125)

this model

Dataset used to train ronantakizawa/codereview-qwen32b-adapter

Evaluation results

BLEU-4 on GitHub Code Review
test set self-reported

16.910
ROUGE-L F1 on GitHub Code Review
test set self-reported

0.216
Sentence-BERT Cosine Similarity on GitHub Code Review
test set self-reported

0.526
Comment Type Accuracy on GitHub Code Review
test set self-reported

0.640