File size: 6,990 Bytes
f3238ee 8be47ca f3238ee a993783 f3238ee ff4f38a 1659052 6c04c68 ff4f38a a993783 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
library_name: transformers
pipeline_tag: text-generation
base_model:
- aisingapore/WangchanLION-v3
language:
- en
- th
license: llama3.1
base_model_relation: finetune
---
Current Version: `22.07.2025`
# WangchanLION-v3
WangchanLION is a joint effort between VISTEC and AI Singapore to develop a Thai-specific collection of Large Language Models (LLMs), pre-trained for Southeast Asian (SEA) languages, and instruct-tuned specifically for the Thai language.
WangchanLION-v3 is a multilingual model that has been continual pre-training with around **47.4 billion Thai samples** from [web](https://huggingface.co/datasets/aisingapore/WangchanLION-Web) and [non-web](https://huggingface.co/datasets/aisingapore/WangchanLION-Curated) data
- **Developed by:** Products Pillar, AI Singapore, and VISTEC
- **Funded by:** Singapore NRF, PTT Public Company Limited, SCB Public Company Limited, and SCBX Public Company Limited
- **Model type:** Decoder
- **Languages:** English, Thai
- **License:** [Llama 3.1 Community License](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/LICENSE)
## Model Details
### Model Description
We performed continual pre-training in Thai on [Llama 3.1 8B CPT SEA-LIONv3 instruction](https://huggingface.co/aisingapore/Llama-SEA-LION-v3-8B-IT), a decoder model using the Llama3 architecture, to create WangchanLION-v3.
For tokenization, the model employs the default tokenizer used in Llama3.1-8B. The model has a context length of 128k.
### Usage
**NOTE** This model has not been trained to use a system prompt or to use tool calling. Therefore, it can only use with the supervised fine-tuning (SFT).
WangchanLION-v3 can be run using the 🤗 Transformers library
```python
# Please use transformers==4.45.2
import transformers
import torch
model_id = "aisingapore/WangchanLION-v3"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "user", "content": "แต่งกลอนให้หน่อย"},
]
outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
```
### Caveats
It is important for users to be aware that our model exhibits certain limitations that warrant consideration. Like many LLMs, the model can hallucinate and occasionally generates irrelevant content, introducing fictional elements that are not grounded in the provided context. Users should also exercise caution in interpreting and validating the model's responses due to the potential inconsistencies in its reasoning.
## Limitations
### Safety
Current SEA-LION models, including this commercially permissive release, have not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.
## Technical Specifications
### Pre-training Details
Please read our [research paper]() to understand how we train the model
## Call for Contributions
We encourage researchers, developers, and language enthusiasts to actively contribute to the enhancement and expansion of SEA-LION. Contributions can involve identifying and reporting bugs, sharing pre-training, instruction, and preference data, improving documentation usability, proposing and implementing new model evaluation tasks and metrics, or training versions of the model in additional Southeast Asian languages. Join us in shaping the future of SEA-LION by sharing your expertise and insights to make these models more accessible, accurate, and versatile. Please check out our GitHub for further information on the call for contributions.
## The Team
### AISG
Chan Adwin, Choa Esther, Cheng Nicholas, Huang Yuli, Lau Wayne, Lee Chwan Ren, Leong Wai Yi, Leong Wei Qi, Limkonchotiwat Peerat, Liu Bing Jie Darius, Montalan Jann Railey, Ng Boon Cheong Raymond, Ngui Jian Gang, Nguyen Thanh Ngan, Ong Brandon, Ong Tat-Wee David, Ong Zhi Hao, Rengarajan Hamsawardhini, Siow Bryan, Susanto Yosephine, Tai Ngee Chia, Tan Choon Meng, Teo Eng Sipp Leslie, Teo Wei Yi, Tjhi William, Teng Walter, Yeo Yeow Tong, Yong Xianbin
### WangchanX
Can Udomcharoenchaikit, Chalermpun Mai-On, Chayapat Uthayopas, Ekapol Chuangsuwanich, Lalita Lowphansirikul, Nonthakit Chaiwong, Panuthep Tasawong, Patomporn Payoungkhamdee, Pume Tuchinda, Romrawin Chumpu, Sarana Nutanong, Wannaphong Phatthiyaphaibun
## Acknowledgements
[AI Singapore](https://aisingapore.org/) is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the National Research Foundation or the National University of Singapore.
This release is part of WangchanX, a Large Language Model (LLM) research and development project supported by PTT Public Company Limited, SCB Public Company Limited, and SCBX Public Company Limited. The project is a collaborative effort originated by PyThaiNLP and VISTEC-depa Thailand AI Research Institute, focusing on the development of Adaptation Toolsets, Instruction Tuning & Alignment Datasets, and Benchmarks.
## Contact
- Peerat Limkonchotiwat [email protected]
[Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion)
[Link to WangchanX FLAN-like Dataset Creation Github repository](https://github.com/vistec-AI/WangchanX/tree/datasets)
## Citation
```
@misc{phatthiyaphaibun2025mangosteenopenthaicorpus,
title={Mangosteen: An Open Thai Corpus for Language Model Pretraining},
author={Wannaphong Phatthiyaphaibun and Can Udomcharoenchaikit and Pakpoom Singkorapoom and Kunat Pipatanakul and Ekapol Chuangsuwanich and Peerat Limkonchotiwat and Sarana Nutanong},
year={2025},
eprint={2507.14664},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.14664},
}
```
## Disclaimer
This is the repository for the commercial instruction-tuned model.
The model has _not_ been aligned for safety.
Developers and users should perform their own safety fine-tuning and related security measures.
In no event shall the authors be held liable for any claims, damages, or other liabilities arising from the use of the released weights and codes.
Resources
- Pre-training data (web): https://huggingface.co/datasets/aisingapore/WangchanLION-Web
- Pre-training data (curated): https://huggingface.co/datasets/aisingapore/WangchanLION-Curated
- Pre-training model: https://huggingface.co/aisingapore/WangchanLION-v3
- SFT model: https://huggingface.co/aisingapore/WangchanLION-v3-IT
- Paper: https://arxiv.org/abs/2507.14664
- Blog: https://sea-lion.ai/sea-lion-wangchanlionv3/
- Github: https://github.com/vistec-AI/Mangosteen
|