thnhan3
/

sft_model

+---
+license: mit
+datasets:
+- 8Opt/vietnamese-summarization-dataset-0001
+language:
+- vi
+metrics:
+- bertscore
+- rouge
+base_model:
+- VietAI/vit5-base
+pipeline_tag: summarization
+library_name: transformers
+---
+# ViT5 Vietnamese Summarization
+Fine-tuned ViT5 model for Vietnamese text summarization.
+## Model Description
+This model is a fine-tuned version of ViT5 on Vietnamese summarization dataset. Unified extractive/abstractive summaries from Vietnamese documents.
+**Base Model:** VietAI/vit5-base
+**Task:** Abstractive Text Summarization
+**Language:** Vietnamese
+## Training Configuration
+- **max_input_length:** 1280 tokens
+- **max_output_length:** 256 tokens
+- **Training dataset:** [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001)
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Basic Usage
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
+model_name = "thnhan3/sft_model"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
+document = """
+Ngày 16 tháng 11 năm 2025, Chính phủ Việt Nam công bố kế hoạch phát triển kinh tế số
+trong giai đoạn 2025-2030. Kế hoạch tập trung vào 3 trọng tâm chính: phát triển hạ tầng
+số, đào tạo nguồn nhân lực công nghệ cao, và thúc đẩy chuyển đổi số doanh nghiệp.
+Mục tiêu đặt ra là đến năm 2030, kinh tế số chiếm 30% GDP và tạo ra 2 triệu việc làm mới.
+"""
+inputs = tokenizer(
+    document,
+    max_length=1280,
+    truncation=True,
+    return_tensors="pt"
+)
+outputs = model.generate(
+    inputs.input_ids,
+    max_new_tokens=256,
+    num_beams=4,
+    length_penalty=1.0,
+    early_stopping=True,
+    no_repeat_ngram_size=3
+)
+summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(summary)
+```
+### Batch Processing
+```python
+import torch
+documents = [
+    "Văn bản 1...",
+    "Văn bản 2...",
+    "Văn bản 3...",
+]
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device)
+inputs = tokenizer(
+    documents,
+    max_length=1280,
+    truncation=True,
+    padding=True,
+    return_tensors="pt"
+).to(device)
+outputs = model.generate(
+    inputs.input_ids,
+    attention_mask=inputs.attention_mask,
+    max_new_tokens=256,
+    num_beams=4,
+    length_penalty=1.0,
+    early_stopping=True,
+    no_repeat_ngram_size=3
+)
+summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+for i, summary in enumerate(summaries):
+    print(f"Summary {i+1}: {summary}")
+```
+### Optimized Inference with FP16
+```python
+import torch
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model = model.to(device).half()
+with torch.inference_mode():
+    inputs = tokenizer(
+        document,
+        max_length=1280,
+        truncation=True,
+        return_tensors="pt"
+    ).to(device)
+    with torch.amp.autocast('cuda'):
+        outputs = model.generate(
+            inputs.input_ids,
+            max_new_tokens=256,
+            num_beams=4,
+            length_penalty=1.0,
+            early_stopping=True,
+            no_repeat_ngram_size=3
+        )
+    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
+```
+## Generation Parameters
+Recommended parameters for best quality:
+- `max_new_tokens`: 256 (matches training configuration)
+- `num_beams`: 4 (beam search for better quality)
+- `length_penalty`: 1.0 (neutral length preference)
+- `early_stopping`: True (stop when EOS token generated)
+- `no_repeat_ngram_size`: 3 (avoid repetitive phrases)
+You can adjust these parameters based on your needs:
+- Increase `num_beams` (5-8) for potentially better quality but slower generation
+- Decrease `num_beams` (2-3) for faster generation with slight quality trade-off
+- Adjust `length_penalty` (0.8-1.2) to control summary length
+## Model Performance
+Evaluated on test set of [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001):
+~comming soon~
+## Limitations
+- Maximum input length: 1280 tokens. Longer documents will be truncated.
+- Trained on Vietnamese news/formal text.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{vit5-vietnamese-summarization,
+  author = {Tran Huu Nhan},
+  title = {ViT5 Vietnamese Summarization},
+  year = {2025},
+  publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/thnhan3/sft_model}}
+}
+```
+## License
+MIT
+## Acknowledgments
+- Base model: [VietAI/vit5-base](https://huggingface.co/VietAI/vit5-base)
+- Training dataset: [8Opt/vietnamese-summarization-dataset-0001](https://huggingface.co/datasets/8Opt/vietnamese-summarization-dataset-0001)