Model	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum
T5-GenQ-T-v1	75.2151	54.8735	74.5142	74.5262
T5-GenQ-TD-v1	78.2570	58.9586	77.5308	77.5466
T5-GenQ-TDE-v1	76.9075	57.0980	76.1464	76.1502
T5-GenQ-TDC-v1 (best)	80.0754	61.5974	79.3557	79.3427

Model

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-Lsum

T5-GenQ-T-v1

75.2151

54.8735

74.5142

74.5262

T5-GenQ-TD-v1

78.2570

58.9586

77.5308

77.5466

T5-GenQ-TDE-v1

76.9075

57.0980

76.1464

76.1502

T5-GenQ-TDC-v1 (best)

80.0754

61.5974

79.3557

79.3427

Model	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum
T5-GenQ-T-v1	73.11	52.27	72.51	72.51
query-gen-msmarco-t5-base-v1	40.34	19.52	39.21	39.21

Model

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-Lsum

T5-GenQ-T-v1

73.11

52.27

72.51

query-gen-msmarco-t5-base-v1

40.34

19.52

39.21

Input Text	Target Query	Before Fine-tuning	After Fine-tuning
PANDORA Jewelry Crossover Pave Triple Band Ring for Women - Sterling Silver with Cubic Zirconia	PANDORA Crossover Triple Band Ring	what is pandora jewelry	Pandora crossover ring
SAYOYO Baby Sneakers Leather Baby Shoes Crib Shoes Toddler Soft Sole Sneakers	SAYOYO Baby Sneakers	what kind of shoes are baby sneakers	baby leather sneakers
5 PCS Strap Replacement Compatible with Xiaomi Mi Band 3/4, Bands Xiaomi Mi Band 4 Smart Watch Wristbands Replacement Accessories Strap Bracelets for Mi Fit 3 Straps	Replacement Straps for Xiaomi Mi Band 3/4p	what is the strap on a xiaomi smartwatch	Xiaomi Mi Fit 3 replacement bands
Backpacker Ladies' Solid Flannel Shirt	ladies flannel shirt	what kind of shirt is a backpacker	women's flannel shirt

Input Text

Target Query

Before Fine-tuning

After Fine-tuning

PANDORA Jewelry Crossover Pave Triple Band Ring for Women - Sterling Silver with Cubic Zirconia

PANDORA Crossover Triple Band Ring

what is pandora jewelry

Pandora crossover ring

SAYOYO Baby Sneakers Leather Baby Shoes Crib Shoes Toddler Soft Sole Sneakers

SAYOYO Baby Sneakers

what kind of shoes are baby sneakers

baby leather sneakers

5 PCS Strap Replacement Compatible with Xiaomi Mi Band 3/4, Bands Xiaomi Mi Band 4 Smart Watch Wristbands Replacement Accessories Strap Bracelets for Mi Fit 3 Straps

Replacement Straps for Xiaomi Mi Band 3/4p

what is the strap on a xiaomi smartwatch

Xiaomi Mi Fit 3 replacement bands

Backpacker Ladies' Solid Flannel Shirt

ladies flannel shirt

what kind of shirt is a backpacker

women's flannel shirt

Epoch	Step	Loss	Grad Norm	Learning Rate	Eval Loss	ROUGE-1	ROUGE-2	ROUGE-L	ROUGE-Lsum
1.0	4285	0.9465	6.7834	4.9e-05	0.7644	73.1872	52.2019	72.5199	72.5183
2.0	8570	0.8076	4.9071	4.2e-05	0.7268	73.9182	53.1365	73.2551	73.2570
3.0	12855	0.7485	4.4814	3.5e-05	0.7160	74.4752	53.8076	73.7712	73.7792
4.0	17140	0.7082	5.3145	2.8e-05	0.7023	74.7628	54.3316	74.0811	74.0790
5.0	21425	0.6788	4.4266	2.1e-05	0.7013	74.9437	54.5630	74.2637	74.2668
6.0	25710	0.6561	5.2897	1.4e-05	0.6998	75.0834	54.7163	74.3907	74.3977
7.0	29995	0.6396	3.5197	7.0e-06	0.7005	75.2151	54.8735	74.5142	74.5262
8.0	34280	0.6278	4.4625	0.0	0.7016	75.1899	54.8423	74.4695	74.4801

Epoch

Step

Loss

Grad Norm

Learning Rate

Eval Loss

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-Lsum

1.0

4285

0.9465

6.7834

4.9e-05

0.7644

73.1872

52.2019

72.5199

72.5183

2.0

8570

0.8076

4.9071

4.2e-05

0.7268

73.9182

53.1365

73.2551

73.2570

3.0

12855

0.7485

4.4814

3.5e-05

0.7160

74.4752

53.8076

73.7712

73.7792

4.0

17140

0.7082

5.3145

2.8e-05

0.7023

74.7628

54.3316

74.0811

74.0790

5.0

21425

0.6788

4.4266

2.1e-05

0.7013

74.9437

54.5630

74.2637

74.2668

6.0

25710

0.6561

5.2897

1.4e-05

0.6998

75.0834

54.7163

74.3907

74.3977

7.0

29995

0.6396

3.5197

7.0e-06

0.7005

75.2151

54.8735

74.5142

74.5262

8.0

34280

0.6278

4.4625

0.0

0.7016

75.1899

54.8423

74.4695

74.4801

### Model Analysis

Average scores by model

The checkpoint-29995 (T5-GenQ-T-v1) model outperforms query-gen-msmarco-t5-base-v1 across all ROUGE metrics. The largest performance gap is in ROUGE2, where checkpoint-29995 achieves 52.27, whereas query-gen-msmarco-t5-base-v1 scores 19.52. ROUGE1, ROUGEL, and ROUGELSUM scores are very similar in both trends, with checkpoint-29995 consistently scoring above 72, while query-gen-msmarco-t5-base-v1 stays below 41.

Density comparison

```T5-GenQ-T-v1``` - Higher concentration of high ROUGE scores, especially near 100%, indicating strong text overlap with references. ```query-gen-msmarco-t5-base-v1``` – more spread-out distribution, with multiple peaks at 10-40%, suggesting greater variability but lower precision. ROUGE-1 & ROUGE-L: ```T5-GenQ-T-v1``` peaks at 100%, while ```query-gen-msmarco-t5-base-v1``` has lower, broader peaks. ROUGE-2: ```query-gen-msmarco-t5-base-v1``` has a high density at 0%, indicating many low-overlap outputs.

Histogram comparison

```T5-GenQ-T-v1``` – higher concentration of high ROUGE scores, especially near 100%, indicating strong text overlap with references. ```query-gen-msmarco-t5-base-v1``` – more spread-out distribution, with peaks in the 10-40% range, suggesting greater variability but lower precision. ROUGE-1 & ROUGE-L: ```T5-GenQ-T-v1``` shows a rising trend towards higher scores, while ```query-gen-msmarco-t5-base-v1``` has multiple peaks at lower scores. ROUGE-2: ```query-gen-msmarco-t5-base-v1``` has a high concentration of low-score outputs, whereas ```T5-GenQ-T-v1``` achieves more high-scoring outputs.

Scores by generated query length

This visualization analyzes average ROUGE scores and score differences across different query sizes. High ROUGE Scores for Most Sizes (3-9 words). ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-LSUM scores remain consistently high across most word sizes. Sharp Spike at Size 2: A large positive score difference at 2 words, suggesting strong alignment for very short phrases. Stable Score Differences (Sizes 3-9): After the initial spike at size 2, score differences stay close to zero, indicating consistent performance across phrase lengths.

Semantic similarity distribution

This histogram visualizes the distribution of cosine similarity scores, which measure the semantic similarity between paired texts. The majority of similarity scores cluster near 1.0, indicating that most text pairs are highly similar. A gradual increase in frequency is observed as similarity scores rise, with a sharp peak at 1.0. Lower similarity scores (0.0–0.4) are rare, suggesting fewer instances of dissimilar text pairs.

Semantic similarity score against ROUGE scores

This scatter plot matrix compares semantic similarity (cosine similarity) with ROUGE scores, showing their correlation. Higher similarity → Higher ROUGE scores, indicating strong n-gram overlap in semantically similar texts. ROUGE-1 & ROUGE-L show the strongest correlation, while ROUGE-2 has more variability. Low-similarity outliers exist, where texts share words but differ semantically.