thnhan3 commited on
Commit
ebbd10b
·
verified ·
1 Parent(s): 23bd916

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +192 -3
README.md CHANGED
@@ -1,3 +1,192 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - 8Opt/vietnamese-summarization-dataset-0001
5
+ language:
6
+ - vi
7
+ metrics:
8
+ - bertscore
9
+ - rouge
10
+ base_model:
11
+ - VietAI/vit5-base
12
+ pipeline_tag: summarization
13
+ library_name: transformers
14
+ ---
15
+
16
+ # ViT5 Vietnamese Summarization
17
+
18
+ Fine-tuned ViT5 model for Vietnamese text summarization.
19
+
20
+ ## Model Description
21
+
22
+ This model is a fine-tuned version of ViT5 on Vietnamese summarization dataset. Unified extractive/abstractive summaries from Vietnamese documents.
23
+
24
+ **Base Model:** VietAI/vit5-base
25
+ **Task:** Abstractive Text Summarization
26
+ **Language:** Vietnamese
27
+
28
+ ## Training Configuration
29
+
30
+ - **max_input_length:** 1280 tokens
31
+ - **max_output_length:** 256 tokens
32
+ - **Training dataset:** [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001)
33
+
34
+ ## Usage
35
+
36
+ ### Installation
37
+
38
+ ```bash
39
+ pip install transformers torch
40
+ ```
41
+
42
+ ### Basic Usage
43
+
44
+ ```python
45
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
46
+
47
+ model_name = "thnhan3/sft_model"
48
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
49
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
50
+
51
+ document = """
52
+ Ngày 16 tháng 11 năm 2025, Chính phủ Việt Nam công bố kế hoạch phát triển kinh tế số
53
+ trong giai đoạn 2025-2030. Kế hoạch tập trung vào 3 trọng tâm chính: phát triển hạ tầng
54
+ số, đào tạo nguồn nhân lực công nghệ cao, và thúc đẩy chuyển đổi số doanh nghiệp.
55
+ Mục tiêu đặt ra là đến năm 2030, kinh tế số chiếm 30% GDP và tạo ra 2 triệu việc làm mới.
56
+ """
57
+
58
+ inputs = tokenizer(
59
+ document,
60
+ max_length=1280,
61
+ truncation=True,
62
+ return_tensors="pt"
63
+ )
64
+
65
+ outputs = model.generate(
66
+ inputs.input_ids,
67
+ max_new_tokens=256,
68
+ num_beams=4,
69
+ length_penalty=1.0,
70
+ early_stopping=True,
71
+ no_repeat_ngram_size=3
72
+ )
73
+
74
+ summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
75
+ print(summary)
76
+ ```
77
+
78
+ ### Batch Processing
79
+
80
+ ```python
81
+ import torch
82
+
83
+ documents = [
84
+ "Văn bản 1...",
85
+ "Văn bản 2...",
86
+ "Văn bản 3...",
87
+ ]
88
+
89
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
90
+ model = model.to(device)
91
+
92
+ inputs = tokenizer(
93
+ documents,
94
+ max_length=1280,
95
+ truncation=True,
96
+ padding=True,
97
+ return_tensors="pt"
98
+ ).to(device)
99
+
100
+ outputs = model.generate(
101
+ inputs.input_ids,
102
+ attention_mask=inputs.attention_mask,
103
+ max_new_tokens=256,
104
+ num_beams=4,
105
+ length_penalty=1.0,
106
+ early_stopping=True,
107
+ no_repeat_ngram_size=3
108
+ )
109
+
110
+ summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)
111
+ for i, summary in enumerate(summaries):
112
+ print(f"Summary {i+1}: {summary}")
113
+ ```
114
+
115
+ ### Optimized Inference with FP16
116
+
117
+ ```python
118
+ import torch
119
+
120
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
121
+ model = model.to(device).half()
122
+
123
+ with torch.inference_mode():
124
+ inputs = tokenizer(
125
+ document,
126
+ max_length=1280,
127
+ truncation=True,
128
+ return_tensors="pt"
129
+ ).to(device)
130
+
131
+ with torch.amp.autocast('cuda'):
132
+ outputs = model.generate(
133
+ inputs.input_ids,
134
+ max_new_tokens=256,
135
+ num_beams=4,
136
+ length_penalty=1.0,
137
+ early_stopping=True,
138
+ no_repeat_ngram_size=3
139
+ )
140
+
141
+ summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
142
+ ```
143
+
144
+ ## Generation Parameters
145
+
146
+ Recommended parameters for best quality:
147
+
148
+ - `max_new_tokens`: 256 (matches training configuration)
149
+ - `num_beams`: 4 (beam search for better quality)
150
+ - `length_penalty`: 1.0 (neutral length preference)
151
+ - `early_stopping`: True (stop when EOS token generated)
152
+ - `no_repeat_ngram_size`: 3 (avoid repetitive phrases)
153
+
154
+ You can adjust these parameters based on your needs:
155
+
156
+ - Increase `num_beams` (5-8) for potentially better quality but slower generation
157
+ - Decrease `num_beams` (2-3) for faster generation with slight quality trade-off
158
+ - Adjust `length_penalty` (0.8-1.2) to control summary length
159
+
160
+ ## Model Performance
161
+
162
+ Evaluated on test set of [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001):
163
+
164
+ ~comming soon~
165
+
166
+ ## Limitations
167
+
168
+ - Maximum input length: 1280 tokens. Longer documents will be truncated.
169
+ - Trained on Vietnamese news/formal text.
170
+
171
+ ## Citation
172
+
173
+ If you use this model, please cite:
174
+
175
+ ```bibtex
176
+ @misc{vit5-vietnamese-summarization,
177
+ author = {Tran Huu Nhan},
178
+ title = {ViT5 Vietnamese Summarization},
179
+ year = {2025},
180
+ publisher = {Hugging Face},
181
+ howpublished = {\url{https://huggingface.co/thnhan3/sft_model}}
182
+ }
183
+ ```
184
+
185
+ ## License
186
+
187
+ MIT
188
+
189
+ ## Acknowledgments
190
+
191
+ - Base model: [VietAI/vit5-base](https://huggingface.co/VietAI/vit5-base)
192
+ - Training dataset: [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001)