Marian MT fine-tuned on the Multilingual Corpus of World’s Constitutions (MCWC)
This model is a fine-tuned version of Helsinki-NLP/opus-mt-en-es, adapted using high-quality sentence-aligned constitutional text from the Multilingual Corpus of World’s Constitutions (MCWC)
📄 MCWC paper (OSACT 2024): https://aclanthology.org/2024.osact-1.7/
This variant handles: English → Spanish translation.
Overview
The MCWC provides a curated multilingual collection of constitutional texts from countries across the world. The corpus emphasises data cleanliness, high-quality sentence alignment, and detailed metadata (including country and continent mappings). It supports research in:
- legal and constitutional NLP
- comparative constitutional studies
- multilingual machine translation
- cross-lingual semantic analysis
This model was fine-tuned on the English–Spanish segment of the MCWC, enabling translation that is more attuned to legal and constitutional language than general-purpose MT systems.
Intended use
This model is suitable for tasks such as:
- translating constitutional or legal documents
- cross-lingual legal text comparison
- multilingual information extraction
- downstream legal NLP tasks requiring domain-specific MT
It is not intended for casual or conversational translation, as it is optimised for formal and legal text.
Training data
The model was trained on the MCWC’s English–Spanish aligned sentence pairs.
The MCWC dataset includes:
- cleaned constitutional text
- high-quality sentence segmentation
- pairwise alignments
- country and regional metadata
More details may be found in the accompanying paper:
El-Haj, M. & Ezzini, S. (2024). “The Multilingual Corpus of World’s Constitutions (MCWC).”
OSACT @ LREC-COLING 2024.
https://aclanthology.org/2024.osact-1.7/
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- optimizer: {'name': 'AdamWeightDecay', 'learning_rate': {'module': 'keras.optimizers.schedules', 'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 5e-05, 'decay_steps': 3666, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, 'registered_name': None}, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False, 'weight_decay_rate': 0.01}
- training_precision: mixed_float16
Training results
| Train Loss | Validation Loss | Epoch |
|---|---|---|
| 1.1192 | 1.0032 | 0 |
| 0.9233 | 0.9729 | 1 |
| 0.8292 | 0.9642 | 2 |
Framework versions
- Transformers 4.33.3
- TensorFlow 2.13.0
- Datasets 2.14.5
- Tokenizers 0.13.3
- Downloads last month
- 14
Model tree for drelhaj/marian-finetuned-mcwc-en-to-es
Base model
Helsinki-NLP/opus-mt-en-es