The 5th-Generation Merged Model of YOYO-AI and the brand-new merging algorithm "yoyo_fusion" have been officially released!

Model Highlights:

  • merge method: yoyo_fusion

  • precision: dtype: bfloat16

  • Context length: 262,144&1010000

Parameter Settings:

Temperature=0.7, TopP=0.8, TopK=20,MinP=0.

GitHub Repository:

YOYO-Fusion

Configuration:

The following configuration was used to produce this model:

from yoyo_fusion import run_merge

run_merge(
    model_paths=[
        "Qwen/Qwen3-30B-A3B-Instruct-2507",
        "Qwen/Qwen3-30B-A3B-Thinking-2507",
        "Qwen/Qwen3-Coder-30B-A3B-Instruct"
    ],
    output_dir="YOYO-AI/Qwen3-30B-A3B-YOYO-V5",
    anchor_index=0,
    config_dir=1,
    use_k_minus_one_truncation=True,
    use_geometric_median=True,
)

YOYO-Fusion: Robust Merging in Residual Subspace

Input

Given K≥2 weight tensors from models with identical architecture:
{T(1),T(2),,T(K)},T(k)Rd1××dn, \{T^{(1)}, T^{(2)}, \dots, T^{(K)}\}, \quad T^{(k)} \in \mathbb{R}^{d_1 \times \cdots \times d_n},


Step 1: Flatten and RMS-normalize each tensor

Flatten each tensor into a vector and normalize by its RMS:
x(k)=flatten(T(k))RD,D=i=1ndi x^{(k)} = \operatorname{flatten}(T^{(k)}) \in \mathbb{R}^D, \quad D = \prod_{i=1}^n d_i
rk=RMS(x(k))=1Di=1D(xi(k))2+ε r_k = \operatorname{RMS}(x^{(k)}) = \sqrt{ \frac{1}{D} \sum_{i=1}^D (x^{(k)}_i)^2 + \varepsilon }
u(k)=x(k)rk+ε u^{(k)} = \frac{x^{(k)}}{r_k + \varepsilon}


Step 2: Determine Center Point

Case A: Anchor Mode

m=un \mathbf{m} = \mathbf{u}_n

Case B: No Anchor Mode

  • Subcase B1:

    Compute the geometric median via the Weiszfeld algorithm:

m=argminyi=1Kuiy2 \mathbf{m} = \arg\min_{\mathbf{y}} \sum_{i=1}^K \| \mathbf{u}_i - \mathbf{y} \|_2

  • Subcase B2:

    Use coordinate-wise median:

mj=median(u1,j,u2,j,,uK,j),j=1,,D m_j = \text{median}(u_{1,j}, u_{2,j}, \dots, u_{K,j}), \quad \forall j=1,\dots,D


Step 3: Compute residual matrix

R=U1KmRK×D \mathbf{R} = \mathbf{U} - \mathbf{1}_K \mathbf{m}^\top \in \mathbb{R}^{K \times D}


Step 4: Early exit if residuals are negligible

If
maxkRk,:2<107, \max_k \|R_{k,:}\|_2 < 10^{-7},
then set
y=m \mathbf{y}' = \mathbf{m}
and skip to Step 8. Otherwise, proceed.


Step 5: Perform SVD on residuals

Compute the thin SVD of R^⊤∈R^D×K:
R=UΣV R^\top = U \Sigma V^\top
Let min(K−1,rank(R)), and take the first r' columns of U :
Ur=U[:,:r]RD×r U_{r'} = U[:, :r'] \in \mathbb{R}^{D \times r'}


Step 6: Compute energy-based scaling factor

Total energy:
Etotal=i=1rankσi2 E_{\text{total}} = \sum_{i=1}^{\operatorname{rank}} \sigma_i^2
Retained energy:
Eretained=i=1rσi2 E_{\text{retained}} = \sum_{i=1}^{r'} \sigma_i^2
Energy ratio:
p=EretainedEtotal+ε p = \frac{E_{\text{retained}}}{E_{\text{total}} + \varepsilon}
Scaling factor (clamped for stability):
λ=min(1p+ε, 10.0) \lambda = \min\left( \frac{1}{p + \varepsilon},\ 10.0 \right)


Step 7: Robust weighted averaging in subspace

Project residuals into subspace

Z=RUrRK×r Z = R U_{r'} \in \mathbb{R}^{K \times r'}

Estimate robust scales

Per-coordinate MAD scale:
sj=1.4826mediank(Zk,j),j=1,,r s_j = 1.4826 \cdot \operatorname{median}_{k} \left( |Z_{k,j}| \right), \quad j = 1, \dots, r'
Per-model residual norm:
zk=Zk,:2 \|z_k\| = \|Z_{k,:}\|_2
Global MAD scale:
sglobal=1.4826mediank(zk) s_{\text{global}} = 1.4826 \cdot \operatorname{median}_{k} \left( \|z_k\| \right)

Compute Tukey bisquare weightsc = 4.685

Coordinate-wise weights:
wk,jcoord=[max(0, 1(Zk,jcsj+ε)2)]2 w^{\text{coord}}_{k,j} = \left[ \max\left( 0,\ 1 - \left( \frac{|Z_{k,j}|}{c \cdot s_j + \varepsilon} \right)^2 \right) \right]^2
Global (per-model) weights:
wkglobal=[max(0, 1(zkcsglobal+ε)2)]2 w^{\text{global}}_k = \left[ \max\left( 0,\ 1 - \left( \frac{\|z_k\|}{c \cdot s_{\text{global}} + \varepsilon} \right)^2 \right) \right]^2
Combined weights:
Wk,j=wk,jcoordwkglobal W_{k,j} = w^{\text{coord}}_{k,j} \cdot w^{\text{global}}_k

Compute robust consensus in subspace

zj=k=1KWk,jZk,jk=1KWk,j+ε,j=1,,r z^*_j = \frac{ \sum_{k=1}^K W_{k,j} Z_{k,j} }{ \sum_{k=1}^K W_{k,j} + \varepsilon }, \quad j = 1, \dots, r'
Reconstruct robust residual:
r=λUrzRD r^* = \lambda \cdot U_{r'} z^* \in \mathbb{R}^D
Final estimate in normalized space:
y=m+r y' = m + r^*


Step 8: Restore average RMS scale

Compute mean RMS across inputs:
rˉ=1Kk=1Krk \bar{r} = \frac{1}{K} \sum_{k=1}^K r_k
Scale back:
y=yrˉ y = y' \cdot \bar{r}


Step 9: Final L2 norm alignment

Compute average L2 norm of original flattened tensors:
nˉ=1Kk=1Kx(k)2 \bar{n} = \frac{1}{K} \sum_{k=1}^K \|x^{(k)}\|_2
Compute current norm:
ny=y2 n_y = \|y\|_2
Final scaling factor:
α=nˉny+ε \alpha = \frac{\bar{n}}{n_y + \varepsilon}
Scaled output vector:
x^=αy \hat{x} = \alpha \cdot y
Reshape to original tensor shape:
T^=reshape(x^, (d1,,dn)) \hat{T} = \operatorname{reshape}(\hat{x},\ (d_1, \dots, d_n))

Downloads last month
222
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YOYO-AI/Qwen3-30B-A3B-YOYO-V5

Collection including YOYO-AI/Qwen3-30B-A3B-YOYO-V5