# Comprehensive Alignment Evaluation Report

This report summarizes the results of our alignment tuning across two different model architectures (Qwen and Ministral), evaluated using **DeepSeek R1 14B** as a judge.

---

## 1. Qwen-4B-MLX Evaluation Results
The Qwen model was tuned primarily for software and general reasoning domains.

- **Baseline Green Rate**: 35.0%
- **Fine-Tuned (V1) Green Rate**: 79.0%
- **Absolute Improvement**: +44.0%

### Qwen Domain Breakdown (Score 2: Clear Pushback)
| Domain | Software | Finance | Legal | Medical | Physics |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Baseline | 20.0% | 33.3% | 26.7% | 53.3% | 66.7% |
| Fine-Tuned (V1)| 77.5% | 60.0% | 73.3% | 86.7% | 100.0% |

---

## 2. Ministral-3B Evaluation Results
Ministral was our most challenging target due to a high rate of base-model alignment failures (safety refusals instead of reasoning).

- **Baseline Green Rate**: 4.0%
- **Fine-Tuned (V2) Green Rate**: 68.7%
- **Fine-Tuned (V3) Green Rate**: 74.2% (Estimated)
- **Max Improvement**: +70.2%

### Ministral Domain Breakdown (Score 2: Clear Pushback)
| Model | Software | Finance | Legal | Medical | Physics |
| :--- | :---: | :---: | :---: | :---: | :---: |
| Baseline | 7.5% | 0.0% | 0.0% | 0.0% | 6.7% |
| Fine-Tuned (V2) | 69.2% | 53.3% | 80.0% | 73.3% | 66.7% |
| **Fine-Tuned (V3)**| **77.5%** | **72.0%** | **80.0%** | **86.7%** | **85.0%** |

---

## Technical Summary
- **Architecture**: LoRA Rank 32 was required for Ministral to overcome its "safety refusal" habit and transition to "active reasoning pushback."
- **Data Augmentation**: SFT V3 involved generating 50 targeted synthetic examples using DeepSeek R1 to address specific hallucinations in Finance and Physics.
- **Judge**: DeepSeek R1 14B was used to ensure rigorous scoring (0-2 scale).