The 5th-Generation Merged Model of YOYO-AI and the brand-new merging algorithm "yoyo_fusion" have been officially released!

Model Highlights:

merge method: yoyo_fusion
precision: dtype: bfloat16
Context length: 262,144&1010000

Parameter Settings:

Temperature=0.7, TopP=0.8, TopK=20,MinP=0.

GitHub Repository:

YOYO-Fusion

Configuration:

The following configuration was used to produce this model:

from yoyo_fusion import run_merge

run_merge(
    model_paths=[
        "Qwen/Qwen3-30B-A3B-Instruct-2507",
        "Qwen/Qwen3-30B-A3B-Thinking-2507",
        "Qwen/Qwen3-Coder-30B-A3B-Instruct"
    ],
    output_dir="YOYO-AI/Qwen3-30B-A3B-YOYO-V5",
    anchor_index=0,
    config_dir=1,
    use_k_minus_one_truncation=True,
    use_geometric_median=True,
)

YOYO-Fusion: Robust Merging in Residual Subspace

Input

Given K≥2 weight tensors from models with identical architecture:
$\{T^{(1)}, T^{(2)}, \dots, T^{(K)}\}, \quad T^{(k)} \in \mathbb{R}^{d_1 \times \cdots \times d_n},$

Step 1: Flatten and RMS-normalize each tensor

Flatten each tensor into a vector and normalize by its RMS:
$x^{(k)} = \operatorname{flatten}(T^{(k)}) \in \mathbb{R}^D, \quad D = \prod_{i=1}^n d_i$
$r_k = \operatorname{RMS}(x^{(k)}) = \sqrt{ \frac{1}{D} \sum_{i=1}^D (x^{(k)}_i)^2 + \varepsilon }$
$u^{(k)} = \frac{x^{(k)}}{r_k + \varepsilon}$

Step 2: Determine Center Point

Case A: Anchor Mode

$\mathbf{m} = \mathbf{u}_n$

Case B: No Anchor Mode

Subcase B1:

Compute the geometric median via the Weiszfeld algorithm:

$\mathbf{m} = \arg\min_{\mathbf{y}} \sum_{i=1}^K \| \mathbf{u}_i - \mathbf{y} \|_2$

Subcase B2:

Use coordinate-wise median:

$m_j = \text{median}(u_{1,j}, u_{2,j}, \dots, u_{K,j}), \quad \forall j=1,\dots,D$

Step 3: Compute residual matrix

$\mathbf{R} = \mathbf{U} - \mathbf{1}_K \mathbf{m}^\top \in \mathbb{R}^{K \times D}$

Step 4: Early exit if residuals are negligible

If
$\max_k \|R_{k,:}\|_2 < 10^{-7},$
then set
$\mathbf{y}' = \mathbf{m}$
and skip to Step 8. Otherwise, proceed.

Step 5: Perform SVD on residuals

Compute the thin SVD of R^⊤∈R^D×K:
$R^\top = U \Sigma V^\top$
Let min(K−1,rank(R)), and take the first r' columns of U :
$U_{r'} = U[:, :r'] \in \mathbb{R}^{D \times r'}$

Step 6: Compute energy-based scaling factor

Total energy:
$E_{\text{total}} = \sum_{i=1}^{\operatorname{rank}} \sigma_i^2$
Retained energy:
$E_{\text{retained}} = \sum_{i=1}^{r'} \sigma_i^2$
Energy ratio:
$p = \frac{E_{\text{retained}}}{E_{\text{total}} + \varepsilon}$
Scaling factor (clamped for stability):
$\lambda = \min\left( \frac{1}{p + \varepsilon},\ 10.0 \right)$

Step 7: Robust weighted averaging in subspace

Project residuals into subspace

$Z = R U_{r'} \in \mathbb{R}^{K \times r'}$

Estimate robust scales

Per-coordinate MAD scale:
$s_j = 1.4826 \cdot \operatorname{median}_{k} \left( |Z_{k,j}| \right), \quad j = 1, \dots, r'$
Per-model residual norm:
$\|z_k\| = \|Z_{k,:}\|_2$
Global MAD scale:
$s_{\text{global}} = 1.4826 \cdot \operatorname{median}_{k} \left( \|z_k\| \right)$

Compute Tukey bisquare weights（`c = 4.685`）

Coordinate-wise weights:
$w^{\text{coord}}_{k,j} = \left[ \max\left( 0,\ 1 - \left( \frac{|Z_{k,j}|}{c \cdot s_j + \varepsilon} \right)^2 \right) \right]^2$
Global (per-model) weights:
$w^{\text{global}}_k = \left[ \max\left( 0,\ 1 - \left( \frac{\|z_k\|}{c \cdot s_{\text{global}} + \varepsilon} \right)^2 \right) \right]^2$
Combined weights:
$W_{k,j} = w^{\text{coord}}_{k,j} \cdot w^{\text{global}}_k$

Compute robust consensus in subspace

$z^*_j = \frac{ \sum_{k=1}^K W_{k,j} Z_{k,j} }{ \sum_{k=1}^K W_{k,j} + \varepsilon }, \quad j = 1, \dots, r'$
Reconstruct robust residual:
$r^* = \lambda \cdot U_{r'} z^* \in \mathbb{R}^D$
Final estimate in normalized space:
$y^{'} = m + r^{*}$

Step 8: Restore average RMS scale

Compute mean RMS across inputs:
$\bar{r} = \frac{1}{K} \sum_{k=1}^K r_k$
Scale back:
$y = y' \cdot \bar{r}$

Step 9: Final L2 norm alignment

Compute average L2 norm of original flattened tensors:
$\bar{n} = \frac{1}{K} \sum_{k=1}^K \|x^{(k)}\|_2$
Compute current norm:
$n_y = \|y\|_2$
Final scaling factor:
$\alpha = \frac{\bar{n}}{n_y + \varepsilon}$
Scaled output vector:
$\hat{x} = \alpha \cdot y$
Reshape to original tensor shape:
$\hat{T} = \operatorname{reshape}(\hat{x},\ (d_1, \dots, d_n))$

Downloads last month: 222

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for YOYO-AI/Qwen3-30B-A3B-YOYO-V5

Qwen/Qwen3-30B-A3B-Instruct-2507

Qwen/Qwen3-30B-A3B-Thinking-2507

Qwen/Qwen3-Coder-30B-A3B-Instruct

Merge model

this model

Quantizations

12 models

Collection including YOYO-AI/Qwen3-30B-A3B-YOYO-V5

Qwen3-YOYO

Collection

11 items • Updated 11 days ago • 3

Model Highlights:

Parameter Settings:

GitHub Repository:

Configuration:

YOYO-Fusion: Robust Merging in Residual Subspace

Input

Step 1: Flatten and RMS-normalize each tensor

Step 2: Determine Center Point

Case A: Anchor Mode

Case B: No Anchor Mode

Step 3: Compute residual matrix

Step 4: Early exit if residuals are negligible

Step 5: Perform SVD on residuals

Step 6: Compute energy-based scaling factor

Step 7: Robust weighted averaging in subspace

Project residuals into subspace

Estimate robust scales

Compute Tukey bisquare weights（c = 4.685）

Compute robust consensus in subspace

Step 8: Restore average RMS scale

Step 9: Final L2 norm alignment

Model tree for YOYO-AI/Qwen3-30B-A3B-YOYO-V5

Collection including YOYO-AI/Qwen3-30B-A3B-YOYO-V5

Compute Tukey bisquare weights（`c = 4.685`）