CodeReview-Qwen32B

A fine-tuned Qwen2.5-Coder-32B-Instruct model specialized for code review. Trained on 48,460 human-written code review comments from the top GitHub repositories.

This is a QLoRA adapter (4-bit) trained with SFT.

Results

Metric Base Model (Zero-Shot) SFT Fine-Tuned Change
BLEU-4 3.82 16.91 +343%
ROUGE-L F1 0.081 0.216 +167%
Sentence-BERT Cosine Sim 0.477 0.526 +10%
Comment Type Accuracy 0.00 0.640 --
Comment Type Coverage 0.00 1.00 --

The fine-tuned model produces review comments with 4x more lexical overlap with real human reviews, nearly 3x longer matching subsequences, and correctly classifies its own comment type (suggestion, issue, refactoring, etc.) 64% of the time.

Model Details

Training

Dataset

The training data comes from ronantakizawa/github-codereview, containing 119K real inline code review triplets (code, review comment, revised code) extracted from merged pull requests across 504 top GitHub repositories.

Filtering: Quality score >= 0.5, minimum 50 char comments, deduplicated by (repo, PR, line), capped at 1000 examples per repo. 25 repos held out entirely for OOD evaluation.

Multi-Task SFT

The model was trained on four tasks simultaneously:

Task Mix Description
Comment Generation 50% Code + diff → [comment_type] review comment
Code Refinement 20% Code + comment → revised code
Structured Review 15% Multiple diffs → numbered multi-issue review
Negative Examples 15% Clean code → "no issues found" (reduces false positives)

SFT Hyperparameters

Parameter Value
LoRA rank 32
LoRA alpha 32
LoRA dropout 0
Target modules q/k/v/o_proj, gate/up/down_proj
Learning rate 2e-4 (cosine, 5% warmup)
Effective batch size 16 (per_device=2, grad_accum=8)
Epochs 2
Max sequence length 6,144 tokens
Precision bf16
Packing Enabled
Gradient checkpointing Unsloth mode
Total steps 3,029

Usage

With Unsloth (Recommended)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="ronantakizawa/codereview-qwen32b",
    max_seq_length=6144,
    dtype=None,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

messages = [
    {"role": "system", "content": (
        "You are an expert code reviewer for Python projects. "
        "Analyze the code and diff context, then provide a precise, "
        "actionable review comment identifying issues or suggesting improvements. "
        "If the code is well-written and has no issues, state that clearly."
    )},
    {"role": "user", "content": """**Pull Request:** Fix user authentication
**Repository:** example/app
**File:** auth/login.py
**Language:** Python

**Code under review (around line 42):**
```Python
def authenticate(username, password):
    user = db.query(f"SELECT * FROM users WHERE name='{username}'")
    if user and user.password == password:
        return create_session(user)
    return None

Diff context:

+def authenticate(username, password):
+    user = db.query(f"SELECT * FROM users WHERE name='{username}'")
+    if user and user.password == password:
+        return create_session(user)
+    return None

Review this code change and provide your comment."""}, ]

input_text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=512, temperature=0.1) print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))


### With Transformers + PEFT

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Coder-32B-Instruct",
    device_map="auto",
    torch_dtype="auto",
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, "ronantakizawa/codereview-qwen32b")
tokenizer = AutoTokenizer.from_pretrained("ronantakizawa/codereview-qwen32b")

Output Format

The model prefixes its output with a comment type tag:

[suggestion] Consider using parameterized queries instead of string formatting
to prevent SQL injection vulnerabilities. Replace the f-string with
db.query("SELECT * FROM users WHERE name=?", (username,))

Comment types: suggestion, issue, refactoring, style, performance, security, documentation, no_issues

Limitations

  • Context window: Trained with 6,144 token sequences. Very large files are truncated around the review line.
  • Single-comment focus: Primarily trained on individual inline comments, not full PR-level reviews (though 15% of training was structured multi-issue reviews).
  • Language coverage: Best on Python, JavaScript, TypeScript, Java, Go, C++, Rust (the most represented languages in the training data). May underperform on less common languages.
  • DPO regression: The DPO step did not improve over SFT with quality-score-based preference pairs. Model-generated pairs with an LLM judge may yield better results.

Citation

@misc{codereview-qwen32b,
  title={CodeReview-Qwen32B: Fine-tuned Qwen2.5-Coder-32B for Automated Code Review},
  author={Ronan Takizawa},
  year={2026},
  url={https://huggingface.co/ronantakizawa/codereview-qwen32b}
}
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ronantakizawa/codereview-qwen32b-adapter

Base model

Qwen/Qwen2.5-32B
Adapter
(125)
this model

Dataset used to train ronantakizawa/codereview-qwen32b-adapter

Evaluation results