jjp97 commited on
Commit
a5e4bd5
·
verified ·
1 Parent(s): 0bd9ed9

Initial upload: laal-embedding-v1 (Sentence-Transformers, instruction model)

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language:
4
+
5
+ - ko
6
+ - en
7
+ license: apache-2.0
8
+ library_name: sentence-transformers
9
+ tags:
10
+ - sentence-transformers
11
+ - text-embeddings
12
+ - retrieval
13
+ - mteb
14
+ - korean
15
+ - multilingual
16
+ - e5
17
+ pipeline_tag: sentence-similarity
18
+
19
+ ---
20
+
21
+ # laal-embedding-v1
22
+
23
+ **laal-embedding-v1** is a Sentence-Transformers embedding model fine-tuned from
24
+ **`intfloat/multilingual-e5-large-instruct`** for improved **retrieval-oriented semantic search**, with a focus on **Korean fire-safety and legal-domain text**.
25
+
26
+ * **Base model:** `intfloat/multilingual-e5-large-instruct`
27
+ * **Embedding dimension:** 1024
28
+ * **Similarity function:** cosine
29
+ * **Max tokens:** 512
30
+ * **Architecture:** XLM-RoBERTa (24 layers)
31
+ * **HF repo:** [https://huggingface.co/jjp97/laal-embedding-v1](https://huggingface.co/jjp97/laal-embedding-v1)
32
+
33
+ > ⚠️ **Important**
34
+ > This model uses **fixed instruction prefixes** defined in `config_sentence_transformers.json`.
35
+ > **Always pass raw text to `encode()`**.
36
+ > Do **NOT** manually prepend instruction strings.
37
+
38
+ ---
39
+
40
+ ## Prompting (Important)
41
+
42
+ This model applies different fixed prefixes depending on the input type.
43
+
44
+ ### Query prefix
45
+
46
+ ```
47
+ Instruct: Given a web search query, retrieve relevant passages that answer the query.
48
+ Query:
49
+ ```
50
+
51
+ ### Passage prefix
52
+
53
+ ```
54
+ title: none
55
+ text:
56
+ ```
57
+
58
+ These prefixes are **automatically applied** by Sentence-Transformers via
59
+ `config_sentence_transformers.json`.
60
+
61
+ ### Correct usage ✅
62
+
63
+ ```python
64
+ from sentence_transformers import SentenceTransformer
65
+
66
+ model = SentenceTransformer("jjp97/laal-embedding-v1")
67
+
68
+ q_emb = model.encode("화재 시 대피 방법")
69
+ p_emb = model.encode("화재가 발생하면 즉시 119에 신고하고 안전한 경로로 대피해야 한다.")
70
+ ```
71
+
72
+ ### Incorrect usage ❌ (double-prefixing)
73
+
74
+ ```python
75
+ # Do NOT do this
76
+ q = "Instruct: Given a web search query, retrieve relevant passages...\nQuery: 화재 시 대피 방법"
77
+ emb = model.encode(q)
78
+ ```
79
+
80
+ ---
81
+
82
+ ## Training
83
+
84
+ ### Objective
85
+
86
+ * Contrastive learning (InfoNCE)
87
+ * In-batch negatives
88
+ * Temperature (`tau`): **0.05**
89
+ * Regularization: **GOR (spread-out loss)**
90
+
91
+ * `gor_lambda = 0.001`
92
+ * `gor_max_samples = 64`
93
+
94
+ ### Data
95
+
96
+ * Training examples: **43,983**
97
+ * Format: (query, positive passage)
98
+ * Hard negatives: **enabled**
99
+
100
+ * `max_hn_per_example_train = 2`
101
+
102
+ > Training data consists of domain-specific Korean fire-safety and legal documents
103
+ > (private / curated dataset).
104
+
105
+ ### Hyperparameters (summary)
106
+
107
+ * Batch size: **512**
108
+ * Epochs: **3**
109
+ * Learning rate: **1e-5**
110
+ * Warmup ratio: **0.1**
111
+ * Approx. total steps: **255**
112
+
113
+ ---
114
+
115
+ ## Model Architecture
116
+
117
+ This model follows the standard Sentence-Transformers pipeline:
118
+
119
+ 1. **Transformer**: XLM-RoBERTa (24 layers, hidden size 1024)
120
+ 2. **Pooling**: mean pooling
121
+ 3. **Normalization**: L2 normalization
122
+
123
+ ---
124
+
125
+ ## Intended Use
126
+
127
+ * Retrieval and semantic search (RAG pipelines)
128
+ * Domain-specific QA (fire safety, legal text)
129
+ * Embedding-based similarity and clustering
130
+ (best performance in retrieval-style settings)
131
+
132
+ ---
133
+
134
+ ## Evaluation
135
+
136
+ ### Sanity check
137
+
138
+ Query–passage cosine similarity shows reasonable separation between relevant and irrelevant passages.
139
+
140
+ ### MTEB
141
+
142
+ This model is intended for evaluation on the **MTEB leaderboard**.
143
+ When reporting results, please specify:
144
+
145
+ * model name: `jjp97/laal-embedding-v1`
146
+ * exact revision (commit hash)
147
+ * benchmark suite used
148
+
149
+ ---
150
+
151
+ ## Limitations
152
+
153
+ * Performance may degrade if instruction prefixes are manually added (double-prefixing).
154
+ * Fine-tuned primarily for retrieval; performance on classification/STS tasks may vary.
155
+ * Domain bias toward Korean fire-safety / legal text.
156
+
157
+ ---
158
+
159
+ ## License
160
+
161
+ * Released under **Apache-2.0**, following the base model license.
162
+
163
+ ---
164
+
165
+ ## Acknowledgements
166
+
167
+ * Base model: **Multilingual E5**
168
+ Liang Wang et al., *Multilingual E5 Text Embeddings*, arXiv:2402.05672
169
+ * Sentence-Transformers library
170
+
171
+ ---
172
+
173
+ ## Citation
174
+
175
+ If you use this model, please cite:
176
+
177
+ ```bibtex
178
+ @misc{laal_embedding_v1_2025,
179
+ title = {laal-embedding-v1},
180
+ author = {Park, Jeongjae},
181
+ year = {2025},
182
+ howpublished = {Hugging Face model card},
183
+ }
184
+ ```
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XLMRobertaModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "dtype": "float32",
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "xlm-roberta",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "output_past": true,
21
+ "pad_token_id": 1,
22
+ "position_embedding_type": "absolute",
23
+ "transformers_version": "4.56.1",
24
+ "type_vocab_size": 1,
25
+ "use_cache": false,
26
+ "vocab_size": 250002
27
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "5.1.0",
4
+ "transformers": "4.56.1",
5
+ "pytorch": "2.8.0+cu128"
6
+ },
7
+ "model_type": "SentenceTransformer",
8
+ "prompts": {
9
+ "query": "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: ",
10
+ "passage": "title: none\ntext: "
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5438aef6c99a7765c6126ced5a2902a95525b12bb2bf5ed2ee1008f515ea9476
3
+ size 2239607176
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:883b037111086fd4dfebbbc9b7cee11e1517b5e0c0514879478661440f137085
3
+ size 17082987
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "additional_special_tokens": [],
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": true,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "extra_special_tokens": {},
50
+ "mask_token": "<mask>",
51
+ "model_max_length": 512,
52
+ "pad_token": "<pad>",
53
+ "sep_token": "</s>",
54
+ "tokenizer_class": "XLMRobertaTokenizer",
55
+ "unk_token": "<unk>"
56
+ }
training_summary.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "intfloat/multilingual-e5-large-instruct",
3
+ "train_examples": 43983,
4
+ "batch_size": 512,
5
+ "epochs": 3,
6
+ "learning_rate": 1e-05,
7
+ "total_steps_approx": 255,
8
+ "warmup_ratio": 0.1,
9
+ "tau": 0.05,
10
+ "gor_lambda": 0.001,
11
+ "gor_max_samples": 64,
12
+ "max_hn_per_example_train": 2
13
+ }